* [PATCH v3 01/26] fs/resctrl: Simplify allocation of mon_data structures
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:13 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 02/26] fs-x86/resctrl: Prepare for more monitor events Tony Luck
` (26 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Instead of making a special case to allocate and attach these structures
to kernfs files in the default control group, simply allocate a structure
when a new combination of <rid, domain, mevt, do_sum> is needed and
reuse existing structures when possible.
Free all structures when resctrl filesystem is unmounted.
Partial revert of commit fa563b5171e9 ("x86/resctrl: Expand the width
of dom_id by replacing mon_data_bits")
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 2 +
fs/resctrl/rdtgroup.c | 138 ++++++++++++------------------------------
2 files changed, 40 insertions(+), 100 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index ec3863d18f68..e5976bd52a35 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -83,6 +83,7 @@ struct mon_evt {
/**
* struct mon_data - Monitoring details for each event file.
+ * @list: List of all allocated structures.
* @rid: Resource id associated with the event file.
* @evtid: Event id associated with the event file.
* @sum: Set when event must be summed across multiple
@@ -96,6 +97,7 @@ struct mon_evt {
* rdtgroup_mutex.
*/
struct mon_data {
+ struct list_head list;
unsigned int rid;
enum resctrl_event_id evtid;
unsigned int sum;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 234ec9dbe5b3..338b70c7d302 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -69,6 +69,8 @@ static int rdtgroup_setup_root(struct rdt_fs_context *ctx);
static void rdtgroup_destroy_root(void);
+static void mon_put_kn_priv(void);
+
struct dentry *debugfs_resctrl;
/*
@@ -2873,6 +2875,7 @@ static void rdt_kill_sb(struct super_block *sb)
resctrl_arch_reset_all_ctrls(r);
rmdir_all_sub();
+ mon_put_kn_priv();
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
schemata_list_destroy();
@@ -2895,107 +2898,54 @@ static struct file_system_type rdt_fs_type = {
.kill_sb = rdt_kill_sb,
};
+static LIST_HEAD(kn_priv_list);
+
/**
- * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
- * the default control group.
+ * mon_get_kn_priv() - Get the mon_data priv data for this event
* Called when monitor event files are created for a domain.
- * When called with the default control group, the structure will be allocated.
- * This happens at mount time, before other control or monitor groups are
- * created.
- * This simplifies the lifetime management for rmdir() versus domain-offline
- * as the default control group lives forever, and only one group needs to be
- * special cased.
+ * The same values are used in multiple directories. Keep a list
+ * of allocated structures and reuse an existing one with the same
+ * list of values for rid, domain, etc.
*
- * @r: The resource for the event type being created.
- * @d: The domain for the event type being created.
- * @mevt: The event type being created.
- * @rdtgrp: The rdtgroup for which the monitor file is being created,
- * used to determine if this is the default control group.
- * @do_sum: Whether the SNC sub-numa node monitors are being created.
+ * @rid: The resource for the event type being created.
+ * @domid: The domain for the event type being created.
+ * @mevt: The event type being created.
+ * @do_sum: Whether the SNC sub-numa node monitors are being created.
*/
-static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
- struct rdt_mon_domain *d,
- struct mon_evt *mevt,
- struct rdtgroup *rdtgrp,
- bool do_sum)
+static struct mon_data *mon_get_kn_priv(int rid, int domid, struct mon_evt *mevt, bool do_sum)
{
- struct kernfs_node *kn_dom, *kn_evt;
struct mon_data *priv;
- bool snc_mode;
- char name[32];
- lockdep_assert_held(&rdtgroup_mutex);
-
- snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- if (!do_sum)
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
- else
- sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
-
- kn_dom = kernfs_find_and_get(kn_mondata, name);
- if (!kn_dom)
- return NULL;
-
- kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
-
- /* Is this the creation of the default groups monitor files? */
- if (!kn_evt && rdtgrp == &rdtgroup_default) {
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- if (!priv)
- return NULL;
- priv->rid = r->rid;
- priv->domid = do_sum ? d->ci->id : d->hdr.id;
- priv->sum = do_sum;
- priv->evtid = mevt->evtid;
- return priv;
+ list_for_each_entry(priv, &kn_priv_list, list) {
+ if (priv->rid == rid && priv->domid == domid &&
+ priv->sum == do_sum && priv->evtid == mevt->evtid)
+ return priv;
}
- if (!kn_evt)
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv)
return NULL;
- return kn_evt->priv;
+ priv->rid = rid;
+ priv->domid = domid;
+ priv->sum = do_sum;
+ priv->evtid = mevt->evtid;
+ list_add_tail(&priv->list, &kn_priv_list);
+
+ return priv;
}
/**
- * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
- * all events from the default control group.
- * Put the mon_data priv data for all events for a particular domain.
- * When called with the default control group, the priv structure previously
- * allocated will be kfree()d. This should only be done as part of taking a
- * domain offline.
- * Only a domain offline will 'rmdir' monitor files in the default control
- * group. After domain offline releases rdtgrp_mutex, all references will
- * have been removed.
- *
- * @rdtgrp: The rdtgroup for which the monitor files are being removed,
- * used to determine if this is the default control group.
- * @name: The name of the domain or SNC sub-numa domain which is being
- * taken offline.
+ * mon_put_kn_priv() - Free all allocated mon_data structures
+ * Called when resctrl file system is unmounted.
*/
-static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
+static void mon_put_kn_priv(void)
{
- struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct kernfs_node *kn_dom, *kn_evt;
- struct mon_evt *mevt;
+ struct mon_data *priv, *tmp;
- lockdep_assert_held(&rdtgroup_mutex);
-
- if (rdtgrp != &rdtgroup_default)
- return;
-
- kn_dom = kernfs_find_and_get(kn_mondata, name);
- if (!kn_dom)
- return;
-
- list_for_each_entry(mevt, &r->evt_list, list) {
- kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
- if (!kn_evt)
- continue;
- if (!kn_evt->priv)
- continue;
-
- kfree(kn_evt->priv);
- kn_evt->priv = NULL;
+ list_for_each_entry_safe(priv, tmp, &kn_priv_list, list) {
+ kfree(priv);
+ list_del(&priv->list);
}
}
@@ -3029,16 +2979,12 @@ static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subn
if (!kn)
return;
- mon_put_default_kn_priv_all(rdtgrp, name);
-
kernfs_put(kn);
- if (kn->dir.subdirs <= 1) {
+ if (kn->dir.subdirs <= 1)
kernfs_remove(kn);
- } else {
- mon_put_default_kn_priv_all(rdtgrp, subname);
+ else
kernfs_remove_by_name(kn, subname);
- }
}
/*
@@ -3081,7 +3027,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
return -EPERM;
list_for_each_entry(mevt, &r->evt_list, list) {
- priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
+ priv = mon_get_kn_priv(r->rid, do_sum ? d->ci->id : d->hdr.id, mevt, do_sum);
if (WARN_ON_ONCE(!priv))
return -EINVAL;
@@ -3165,17 +3111,9 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdtgroup *prgrp, *crgrp;
struct list_head *head;
- /*
- * During domain-online create the default control group first
- * so that mon_get_default_kn_priv() can find the allocated structure
- * on subsequent calls.
- */
- mkdir_mondata_subdir(kn_mondata, d, r, &rdtgroup_default);
-
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- if (prgrp != &rdtgroup_default)
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ mkdir_mondata_subdir(parent_kn, d, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 01/26] fs/resctrl: Simplify allocation of mon_data structures
2025-04-07 23:40 ` [PATCH v3 01/26] fs/resctrl: Simplify allocation of mon_data structures Tony Luck
@ 2025-04-18 21:13 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:13 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
...
> + list_for_each_entry_safe(priv, tmp, &kn_priv_list, list) {
> + kfree(priv);
> + list_del(&priv->list);
> }
> }
>
Did not look through this patch in detail considering its other version in the
arch/fs split but this caught my eye. Order should be switched.
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 02/26] fs-x86/resctrl: Prepare for more monitor events
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
2025-04-07 23:40 ` [PATCH v3 01/26] fs/resctrl: Simplify allocation of mon_data structures Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:17 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 03/26] fs/resctrl: Change how events are initialized Tony Luck
` (25 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
There's a rule in computer programming that objects appear zero,
once, or many times. So code accordingly.
There are two MBM events and resctrl is coded with a lot of
if (local)
do one thing
if (total)
do a different thing
Simplify the code by coding for many events using loops on
which are enabled.
Make rdt_mon_features a bitmap to allow for expansion.
Move resctrl_is_mbm_event() to <asm/resctrl.h> as it gets used by core.c
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 6 +--
include/linux/resctrl_types.h | 8 +++
arch/x86/include/asm/resctrl.h | 8 +--
arch/x86/kernel/cpu/resctrl/internal.h | 6 +--
fs/resctrl/internal.h | 4 ++
arch/x86/kernel/cpu/resctrl/core.c | 45 +++++++++--------
arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++------
fs/resctrl/ctrlmondata.c | 41 ++++-----------
fs/resctrl/monitor.c | 70 ++++++++++++++++----------
fs/resctrl/rdtgroup.c | 47 ++++++++---------
10 files changed, 133 insertions(+), 135 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 5c7c8bf2c47f..d6a926b6fc0e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -151,8 +151,7 @@ struct rdt_ctrl_domain {
* @hdr: common header for different domain types
* @ci: cache info for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
- * @mbm_total: saved state for MBM total bandwidth
- * @mbm_local: saved state for MBM local bandwidth
+ * @mbm_states: saved state for each QOS MBM event
* @mbm_over: worker to periodically read MBM h/w counters
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
@@ -162,8 +161,7 @@ struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
struct cacheinfo *ci;
unsigned long *rmid_busy_llc;
- struct mbm_state *mbm_total;
- struct mbm_state *mbm_local;
+ struct mbm_state *mbm_states[QOS_NUM_MBM_EVENTS];
struct delayed_work mbm_over;
struct delayed_work cqm_limbo;
int mbm_work_cpu;
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index a7faf2cd5406..898068a99ef7 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -55,5 +55,13 @@ enum resctrl_event_id {
};
#define QOS_NUM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID + 1)
+#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
+#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
+
+static inline bool resctrl_is_mbm_event(int e)
+{
+ return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
+ e <= QOS_L3_MBM_LOCAL_EVENT_ID);
+}
#endif /* __LINUX_RESCTRL_TYPES_H */
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 6eb7d5c94c7a..4346de48eeab 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -42,7 +42,7 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
extern bool rdt_alloc_capable;
extern bool rdt_mon_capable;
-extern unsigned int rdt_mon_features;
+extern DECLARE_BITMAP(rdt_mon_features, QOS_NUM_EVENTS);
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
@@ -84,17 +84,17 @@ static inline void resctrl_arch_disable_mon(void)
static inline bool resctrl_arch_is_llc_occupancy_enabled(void)
{
- return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID));
+ return test_bit(QOS_L3_OCCUP_EVENT_ID, rdt_mon_features);
}
static inline bool resctrl_arch_is_mbm_total_enabled(void)
{
- return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID));
+ return test_bit(QOS_L3_MBM_TOTAL_EVENT_ID, rdt_mon_features);
}
static inline bool resctrl_arch_is_mbm_local_enabled(void)
{
- return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID));
+ return test_bit(QOS_L3_MBM_LOCAL_EVENT_ID, rdt_mon_features);
}
/*
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 521db28efb3f..45eabc7919c6 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -59,15 +59,13 @@ struct rdt_hw_ctrl_domain {
* struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
* a resource for a monitor function
* @d_resctrl: Properties exposed to the resctrl file system
- * @arch_mbm_total: arch private state for MBM total bandwidth
- * @arch_mbm_local: arch private state for MBM local bandwidth
+ * @arch_mbm_states: arch private state for each MBM event
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
struct rdt_hw_mon_domain {
struct rdt_mon_domain d_resctrl;
- struct arch_mbm_state *arch_mbm_total;
- struct arch_mbm_state *arch_mbm_local;
+ struct arch_mbm_state *arch_mbm_states[QOS_NUM_MBM_EVENTS];
};
static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index e5976bd52a35..7a65ea02d442 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -386,6 +386,10 @@ bool closid_allocated(unsigned int closid);
int resctrl_find_cleanest_closid(void);
+int rdt_lookup_evtid_by_name(char *name);
+
+char *rdt_event_name(enum resctrl_event_id evt);
+
#ifdef CONFIG_RESCTRL_FS_PSEUDO_LOCK
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6ed0d4f5d6a3..6f4a3bd02a42 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -366,8 +366,8 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
- kfree(hw_dom->arch_mbm_total);
- kfree(hw_dom->arch_mbm_local);
+ for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++)
+ kfree(hw_dom->arch_mbm_states[i]);
kfree(hw_dom);
}
@@ -401,25 +401,26 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
*/
static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
- size_t tsize;
-
- if (resctrl_arch_is_mbm_total_enabled()) {
- tsize = sizeof(*hw_dom->arch_mbm_total);
- hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
- if (!hw_dom->arch_mbm_total)
- return -ENOMEM;
- }
- if (resctrl_arch_is_mbm_local_enabled()) {
- tsize = sizeof(*hw_dom->arch_mbm_local);
- hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
- if (!hw_dom->arch_mbm_local) {
- kfree(hw_dom->arch_mbm_total);
- hw_dom->arch_mbm_total = NULL;
- return -ENOMEM;
- }
+ size_t tsize = sizeof(struct arch_mbm_state);
+ int evt, idx;
+
+ for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
+ if (!resctrl_is_mbm_event(evt))
+ continue;
+ idx = MBM_EVENT_IDX(evt);
+ hw_dom->arch_mbm_states[idx] = kcalloc(num_rmid, tsize, GFP_KERNEL);
+ if (!hw_dom->arch_mbm_states[idx])
+ goto cleanup;
}
return 0;
+cleanup:
+ for (idx = 0; idx < QOS_NUM_MBM_EVENTS; idx++) {
+ kfree(hw_dom->arch_mbm_states[idx]);
+ hw_dom->arch_mbm_states[idx] = NULL;
+ }
+
+ return -ENOMEM;
}
static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
@@ -864,13 +865,13 @@ static __init bool get_rdt_mon_resources(void)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
- rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
+ __set_bit(QOS_L3_OCCUP_EVENT_ID, rdt_mon_features);
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL))
- rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID);
+ __set_bit(QOS_L3_MBM_TOTAL_EVENT_ID, rdt_mon_features);
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
- rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
+ __set_bit(QOS_L3_MBM_LOCAL_EVENT_ID, rdt_mon_features);
- if (!rdt_mon_features)
+ if (find_first_bit(rdt_mon_features, QOS_NUM_EVENTS) == QOS_NUM_EVENTS)
return false;
return !rdt_get_mon_l3_config(r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 163174cc0d3e..06623d51d006 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -36,7 +36,7 @@ bool rdt_mon_capable;
/*
* Global to indicate which monitoring events are enabled.
*/
-unsigned int rdt_mon_features;
+DECLARE_BITMAP(rdt_mon_features, QOS_NUM_EVENTS);
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
@@ -168,19 +168,14 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
u32 rmid,
enum resctrl_event_id eventid)
{
- switch (eventid) {
- case QOS_L3_OCCUP_EVENT_ID:
+ struct arch_mbm_state *state;
+
+ if (!resctrl_is_mbm_event(eventid))
return NULL;
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &hw_dom->arch_mbm_total[rmid];
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &hw_dom->arch_mbm_local[rmid];
- }
- /* Never expect to get here */
- WARN_ON_ONCE(1);
+ state = hw_dom->arch_mbm_states[MBM_EVENT_IDX(eventid)];
- return NULL;
+ return state ? &state[rmid] : NULL;
}
void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
@@ -209,14 +204,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ int evt, idx;
+
+ for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
+ idx = MBM_EVENT_IDX(evt);
+ if (!hw_dom->arch_mbm_states[idx])
+ continue;
+ memset(hw_dom->arch_mbm_states[idx], 0,
+ sizeof(struct arch_mbm_state) * r->num_rmid);
+ }
- if (resctrl_arch_is_mbm_total_enabled())
- memset(hw_dom->arch_mbm_total, 0,
- sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
-
- if (resctrl_arch_is_mbm_local_enabled())
- memset(hw_dom->arch_mbm_local, 0,
- sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
}
static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index d56b78450a99..ce02e961a6c3 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -458,7 +458,7 @@ ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct rdtgroup *rdtgrp;
- int ret = 0;
+ int ret;
/* Valid input requires a trailing newline */
if (nbytes == 0 || buf[nbytes - 1] != '\n')
@@ -472,26 +472,15 @@ ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
}
rdt_last_cmd_clear();
- if (!strcmp(buf, "mbm_local_bytes")) {
- if (resctrl_arch_is_mbm_local_enabled())
- rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
- else
- ret = -EINVAL;
- } else if (!strcmp(buf, "mbm_total_bytes")) {
- if (resctrl_arch_is_mbm_total_enabled())
- rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
- else
- ret = -EINVAL;
- } else {
- ret = -EINVAL;
- }
-
- if (ret)
+ ret = rdt_lookup_evtid_by_name(buf);
+ if (ret < 0)
rdt_last_cmd_printf("Unsupported event id '%s'\n", buf);
+ else
+ rdtgrp->mba_mbps_event = ret;
rdtgroup_kn_unlock(of->kn);
- return ret ?: nbytes;
+ return ret < 0 ? ret : nbytes;
}
int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
@@ -502,22 +491,10 @@ int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
rdtgrp = rdtgroup_kn_lock_live(of->kn);
- if (rdtgrp) {
- switch (rdtgrp->mba_mbps_event) {
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- seq_puts(s, "mbm_local_bytes\n");
- break;
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- seq_puts(s, "mbm_total_bytes\n");
- break;
- default:
- pr_warn_once("Bad event %d\n", rdtgrp->mba_mbps_event);
- ret = -EINVAL;
- break;
- }
- } else {
+ if (rdtgrp)
+ seq_printf(s, "%s\n", rdt_event_name(rdtgrp->mba_mbps_event));
+ else
ret = -ENOENT;
- }
rdtgroup_kn_unlock(of->kn);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 3fe21dcf0fde..66e613906f3e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -347,15 +347,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ struct mbm_state *states;
- switch (evtid) {
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &d->mbm_total[idx];
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &d->mbm_local[idx];
- default:
+ if (!resctrl_is_mbm_event(evtid))
return NULL;
- }
+
+ states = d->mbm_states[MBM_EVENT_IDX(evtid)];
+
+ return states ? &states[idx] : NULL;
}
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
@@ -843,20 +842,40 @@ static void dom_data_exit(struct rdt_resource *r)
mutex_unlock(&rdtgroup_mutex);
}
-static struct mon_evt llc_occupancy_event = {
- .name = "llc_occupancy",
- .evtid = QOS_L3_OCCUP_EVENT_ID,
+static struct mon_evt all_events[QOS_NUM_EVENTS] = {
+ [QOS_L3_OCCUP_EVENT_ID] = {
+ .name = "llc_occupancy",
+ .evtid = QOS_L3_OCCUP_EVENT_ID,
+ },
+ [QOS_L3_MBM_TOTAL_EVENT_ID] = {
+ .name = "mbm_total_bytes",
+ .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
+ },
+ [QOS_L3_MBM_LOCAL_EVENT_ID] = {
+ .name = "mbm_local_bytes",
+ .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
+ },
};
-static struct mon_evt mbm_total_event = {
- .name = "mbm_total_bytes",
- .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
-};
+int rdt_lookup_evtid_by_name(char *name)
+{
+ int evt;
-static struct mon_evt mbm_local_event = {
- .name = "mbm_local_bytes",
- .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
-};
+ for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
+ if (!strcmp(name, all_events[evt].name))
+ return evt;
+ }
+
+ return -EINVAL;
+}
+
+char *rdt_event_name(enum resctrl_event_id evt)
+{
+ if (!test_bit(evt, rdt_mon_features))
+ return "unknown";
+
+ return all_events[evt].name;
+}
/*
* Initialize the event list for the resource.
@@ -870,14 +889,13 @@ static struct mon_evt mbm_local_event = {
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
+ int evt;
+
INIT_LIST_HEAD(&r->evt_list);
- if (resctrl_arch_is_llc_occupancy_enabled())
- list_add_tail(&llc_occupancy_event.list, &r->evt_list);
- if (resctrl_arch_is_mbm_total_enabled())
- list_add_tail(&mbm_total_event.list, &r->evt_list);
- if (resctrl_arch_is_mbm_local_enabled())
- list_add_tail(&mbm_local_event.list, &r->evt_list);
+ for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
+ list_add_tail(&all_events[evt].list, &r->evt_list);
+ }
}
/**
@@ -907,12 +925,12 @@ int resctrl_mon_resource_init(void)
l3_mon_evt_init(r);
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
- mbm_total_event.configurable = true;
+ all_events[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) {
- mbm_local_event.configurable = true;
+ all_events[QOS_L3_MBM_LOCAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_local_bytes_config",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 338b70c7d302..8d15d53fae76 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -125,12 +125,6 @@ static bool resctrl_is_mbm_enabled(void)
resctrl_arch_is_mbm_local_enabled());
}
-static bool resctrl_is_mbm_event(int e)
-{
- return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
- e <= QOS_L3_MBM_LOCAL_EVENT_ID);
-}
-
/*
* Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap
* of free CLOSIDs.
@@ -3970,8 +3964,10 @@ static void rdtgroup_setup_default(void)
static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
- kfree(d->mbm_total);
- kfree(d->mbm_local);
+ for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++) {
+ kfree(d->mbm_states[i]);
+ d->mbm_states[i] = NULL;
+ }
}
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
@@ -4031,32 +4027,33 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
- size_t tsize;
+ size_t tsize = sizeof(struct mbm_state);
+ int evt, idx;
if (resctrl_arch_is_llc_occupancy_enabled()) {
d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
- if (resctrl_arch_is_mbm_total_enabled()) {
- tsize = sizeof(*d->mbm_total);
- d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
- if (!d->mbm_total) {
- bitmap_free(d->rmid_busy_llc);
- return -ENOMEM;
- }
- }
- if (resctrl_arch_is_mbm_local_enabled()) {
- tsize = sizeof(*d->mbm_local);
- d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
- if (!d->mbm_local) {
- bitmap_free(d->rmid_busy_llc);
- kfree(d->mbm_total);
- return -ENOMEM;
- }
+
+ for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
+ if (!resctrl_is_mbm_event(evt))
+ continue;
+ idx = MBM_EVENT_IDX(evt);
+ d->mbm_states[idx] = kcalloc(idx_limit, tsize, GFP_KERNEL);
+ if (!d->mbm_states[idx])
+ goto cleanup;
}
return 0;
+cleanup:
+ bitmap_free(d->rmid_busy_llc);
+ for (idx = 0; idx < QOS_NUM_MBM_EVENTS; idx++) {
+ kfree(d->mbm_states[idx]);
+ d->mbm_states[idx] = NULL;
+ }
+
+ return -ENOMEM;
}
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 02/26] fs-x86/resctrl: Prepare for more monitor events
2025-04-07 23:40 ` [PATCH v3 02/26] fs-x86/resctrl: Prepare for more monitor events Tony Luck
@ 2025-04-18 21:17 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:17 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> There's a rule in computer programming that objects appear zero,
> once, or many times. So code accordingly.
>
> There are two MBM events and resctrl is coded with a lot of
>
> if (local)
> do one thing
> if (total)
> do a different thing
>
first change:
> Simplify the code by coding for many events using loops on
> which are enabled.
Please elaborate on how the primary change is the change in data
structure and that is what enables loops to be used.
second change:
>
> Make rdt_mon_features a bitmap to allow for expansion.
... and then a third change: Introduce rdt_lookup_evtid_by_name()
and rdt_event_name().
I recognize three logical changes. Could you please split this patch?
>
> Move resctrl_is_mbm_event() to <asm/resctrl.h> as it gets used by core.c
What the patch actually does is move resctrl_is_mbm_event() to
include/linux/resctrl_types.h that is in itself unexpected
considering what resctrl_types.h is intended to be used for. See
details in changelog of commit
f16adbaf9272 ("x86/resctrl: Move resctrl types to a separate header")
for details on how resctrl_types.h is intended to be used.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 6 +--
> include/linux/resctrl_types.h | 8 +++
> arch/x86/include/asm/resctrl.h | 8 +--
> arch/x86/kernel/cpu/resctrl/internal.h | 6 +--
> fs/resctrl/internal.h | 4 ++
> arch/x86/kernel/cpu/resctrl/core.c | 45 +++++++++--------
> arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++------
> fs/resctrl/ctrlmondata.c | 41 ++++-----------
> fs/resctrl/monitor.c | 70 ++++++++++++++++----------
> fs/resctrl/rdtgroup.c | 47 ++++++++---------
> 10 files changed, 133 insertions(+), 135 deletions(-)
>
...
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index a7faf2cd5406..898068a99ef7 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -55,5 +55,13 @@ enum resctrl_event_id {
> };
>
> #define QOS_NUM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID + 1)
> +#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> +#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
> +
> +static inline bool resctrl_is_mbm_event(int e)
> +{
> + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> + e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
include/linux/resctrl.h should be a better fit.
>
> #endif /* __LINUX_RESCTRL_TYPES_H */
...
>
> static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> @@ -864,13 +865,13 @@ static __init bool get_rdt_mon_resources(void)
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
> - rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
> + __set_bit(QOS_L3_OCCUP_EVENT_ID, rdt_mon_features);
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL))
> - rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID);
> + __set_bit(QOS_L3_MBM_TOTAL_EVENT_ID, rdt_mon_features);
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
> - rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
> + __set_bit(QOS_L3_MBM_LOCAL_EVENT_ID, rdt_mon_features);
>
> - if (!rdt_mon_features)
> + if (find_first_bit(rdt_mon_features, QOS_NUM_EVENTS) == QOS_NUM_EVENTS)
> return false;
Could you please use bitmap_empty() instead? It does the same, but makes it obvious what
is being tested for.
>
> return !rdt_get_mon_l3_config(r);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 163174cc0d3e..06623d51d006 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -36,7 +36,7 @@ bool rdt_mon_capable;
> /*
> * Global to indicate which monitoring events are enabled.
> */
> -unsigned int rdt_mon_features;
> +DECLARE_BITMAP(rdt_mon_features, QOS_NUM_EVENTS);
>
> #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
>
> @@ -168,19 +168,14 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
> u32 rmid,
> enum resctrl_event_id eventid)
> {
> - switch (eventid) {
> - case QOS_L3_OCCUP_EVENT_ID:
> + struct arch_mbm_state *state;
> +
> + if (!resctrl_is_mbm_event(eventid))
> return NULL;
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &hw_dom->arch_mbm_total[rmid];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &hw_dom->arch_mbm_local[rmid];
> - }
>
> - /* Never expect to get here */
> - WARN_ON_ONCE(1);
> + state = hw_dom->arch_mbm_states[MBM_EVENT_IDX(eventid)];
>
> - return NULL;
> + return state ? &state[rmid] : NULL;
> }
>
> void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> @@ -209,14 +204,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> + int evt, idx;
> +
> + for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
> + idx = MBM_EVENT_IDX(evt);
> + if (!hw_dom->arch_mbm_states[idx])
> + continue;
This does not look safe. Missing a resctrl_is_mbm_event() check?
> + memset(hw_dom->arch_mbm_states[idx], 0,
> + sizeof(struct arch_mbm_state) * r->num_rmid);
> + }
>
> - if (resctrl_arch_is_mbm_total_enabled())
> - memset(hw_dom->arch_mbm_total, 0,
> - sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
> -
> - if (resctrl_arch_is_mbm_local_enabled())
> - memset(hw_dom->arch_mbm_local, 0,
> - sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
> }
>
...
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 3fe21dcf0fde..66e613906f3e 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -347,15 +347,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> u32 rmid, enum resctrl_event_id evtid)
> {
> u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> + struct mbm_state *states;
>
> - switch (evtid) {
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &d->mbm_total[idx];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &d->mbm_local[idx];
> - default:
> + if (!resctrl_is_mbm_event(evtid))
> return NULL;
> - }
> +
> + states = d->mbm_states[MBM_EVENT_IDX(evtid)];
> +
> + return states ? &states[idx] : NULL;
> }
>
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> @@ -843,20 +842,40 @@ static void dom_data_exit(struct rdt_resource *r)
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -static struct mon_evt llc_occupancy_event = {
> - .name = "llc_occupancy",
> - .evtid = QOS_L3_OCCUP_EVENT_ID,
> +static struct mon_evt all_events[QOS_NUM_EVENTS] = {
"all_events" is very generic for a global. How about "mon_event_all"
(placing the "_all" at end to match, for example, resctrl_schema_all).
> + [QOS_L3_OCCUP_EVENT_ID] = {
> + .name = "llc_occupancy",
> + .evtid = QOS_L3_OCCUP_EVENT_ID,
> + },
> + [QOS_L3_MBM_TOTAL_EVENT_ID] = {
> + .name = "mbm_total_bytes",
> + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
> + },
> + [QOS_L3_MBM_LOCAL_EVENT_ID] = {
> + .name = "mbm_local_bytes",
> + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
> + },
> };
>
> -static struct mon_evt mbm_total_event = {
> - .name = "mbm_total_bytes",
> - .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
> -};
> +int rdt_lookup_evtid_by_name(char *name)
> +{
> + int evt;
Since this is resctrl fs code, please replace "rdt" with
"resctrl".
>
> -static struct mon_evt mbm_local_event = {
> - .name = "mbm_local_bytes",
> - .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
> -};
> + for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
> + if (!strcmp(name, all_events[evt].name))
> + return evt;
> + }
* This is resctrl fs code. rdt_mon_features should be private to
x86 so resctrl fs should not be accessing it directly. Perhaps
there can be a new arch helper that can be used by resctrl to
query if event is enabled? Similar to resctrl_arch_is_llc_occupancy_enabled()
and friend but where the event ID is parameter and arch code can
use rdt_mon_features.
* While the function name is "lookup_evtid_by_name" this function
does not just look up the event id by name but also ensures that the
event is enabled. The caller then uses "lookup_evtid_by_name" as
a proxy for "is this event enabled". I think the code will be easier
to understand if the functions do not have such hidden "features".
resctrl fs could first use new arch helper to determine if
event is enabled and then use a fs helper that reads name
directly from the event array.
* For a function returning event id the return type is expected to
be enum resctrl_event_id.
> +
> + return -EINVAL;
> +}
> +
> +char *rdt_event_name(enum resctrl_event_id evt)
> +{
> + if (!test_bit(evt, rdt_mon_features))
> + return "unknown";
> +
> + return all_events[evt].name;
> +}
Same comments as rdt_lookup_evtid_by_name()
Also please watch for rdt_mon_features in rest of resctrl fs changes.
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 03/26] fs/resctrl: Change how events are initialized
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
2025-04-07 23:40 ` [PATCH v3 01/26] fs/resctrl: Simplify allocation of mon_data structures Tony Luck
2025-04-07 23:40 ` [PATCH v3 02/26] fs-x86/resctrl: Prepare for more monitor events Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:22 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 04/26] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
` (24 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
New monitor events break some assumptions:
1) New events can be in resources other than L3.
2) Enumeration of events may not be complete during early
boot.
Prepare for events in other resources.
Delay building the event lists until first mount of the resctrl
file system.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 3 +++
fs/resctrl/monitor.c | 30 +++++++++++++++++++-----------
fs/resctrl/rdtgroup.c | 2 ++
3 files changed, 24 insertions(+), 11 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 7a65ea02d442..08dbf89939ac 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -76,6 +76,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
*/
struct mon_evt {
enum resctrl_event_id evtid;
+ enum resctrl_res_level rid;
char *name;
bool configurable;
struct list_head list;
@@ -390,6 +391,8 @@ int rdt_lookup_evtid_by_name(char *name);
char *rdt_event_name(enum resctrl_event_id evt);
+void resctrl_init_mon_events(void);
+
#ifdef CONFIG_RESCTRL_FS_PSEUDO_LOCK
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 66e613906f3e..472754d082cb 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -846,14 +846,17 @@ static struct mon_evt all_events[QOS_NUM_EVENTS] = {
[QOS_L3_OCCUP_EVENT_ID] = {
.name = "llc_occupancy",
.evtid = QOS_L3_OCCUP_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
},
[QOS_L3_MBM_TOTAL_EVENT_ID] = {
.name = "mbm_total_bytes",
.evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
},
[QOS_L3_MBM_LOCAL_EVENT_ID] = {
.name = "mbm_local_bytes",
.evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
},
};
@@ -878,22 +881,29 @@ char *rdt_event_name(enum resctrl_event_id evt)
}
/*
- * Initialize the event list for the resource.
+ * Initialize the event list for all mon_capable resources.
*
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
- *
- * mon_put_default_kn_priv_all() also assumes monitor events are only supported
- * on the L3 resource.
+ * Called on each mount of the resctrl file system when all
+ * events have been enumerated. Only needs to build the per-resource
+ * event lists once.
*/
-static void l3_mon_evt_init(struct rdt_resource *r)
+void resctrl_init_mon_events(void)
{
+ struct rdt_resource *r;
+ static bool only_once;
int evt;
- INIT_LIST_HEAD(&r->evt_list);
+ if (only_once)
+ return;
+ only_once = true;
+
+ for_each_mon_capable_rdt_resource(r)
+ INIT_LIST_HEAD(&r->evt_list);
for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
+ r = resctrl_arch_get_resource(all_events[evt].rid);
+ if (!r->mon_capable)
+ continue;
list_add_tail(&all_events[evt].list, &r->evt_list);
}
}
@@ -922,8 +932,6 @@ int resctrl_mon_resource_init(void)
if (ret)
return ret;
- l3_mon_evt_init(r);
-
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
all_events[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 8d15d53fae76..1433fc098a90 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2574,6 +2574,8 @@ static int rdt_get_tree(struct fs_context *fc)
goto out;
}
+ resctrl_init_mon_events();
+
ret = rdtgroup_setup_root(ctx);
if (ret)
goto out;
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 03/26] fs/resctrl: Change how events are initialized
2025-04-07 23:40 ` [PATCH v3 03/26] fs/resctrl: Change how events are initialized Tony Luck
@ 2025-04-18 21:22 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:22 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> New monitor events break some assumptions:
This changelog jumps into a problem without any context.
Please follow changelog guidance from maintainer-tip.rst.
Specifically, "A good structure is to explain the context,
the problem and the solution in separate paragraphs and this
order."
>
> 1) New events can be in resources other than L3.
> 2) Enumeration of events may not be complete during early
> boot.
>
> Prepare for events in other resources.
Please include what this preparation involves.
>
> Delay building the event lists until first mount of the resctrl
> file system.
Please include in context what is meant by "event lists".
But ... previous patch reminded reader about all the event state
that is allocated during domain online, which usually happens
*before* mount of resctrl. This work thus goes from "use enumeration
of events during boot to allocate necessary event state" in one patch to
"enumeration of events are not complete during boot so build event lists on
resctrl mount" in the next patch. This is a big contradiction to
me.
I think it is clear that not all events can be treated equally but this
implementation pretends to treat them equally when convenient (this patch)
and relies on code flow assumptions (previous patch that only allocated state
for L3 events during domain online) for things to "work out" in the end.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> fs/resctrl/internal.h | 3 +++
> fs/resctrl/monitor.c | 30 +++++++++++++++++++-----------
> fs/resctrl/rdtgroup.c | 2 ++
> 3 files changed, 24 insertions(+), 11 deletions(-)
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 7a65ea02d442..08dbf89939ac 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -76,6 +76,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> */
> struct mon_evt {
> enum resctrl_event_id evtid;
> + enum resctrl_res_level rid;
This structure has some kernel-doc that is not visible in this hunk but
also needs a change when adding a new member.
> char *name;
> bool configurable;
> struct list_head list;
> @@ -390,6 +391,8 @@ int rdt_lookup_evtid_by_name(char *name);
>
> char *rdt_event_name(enum resctrl_event_id evt);
>
> +void resctrl_init_mon_events(void);
> +
> #ifdef CONFIG_RESCTRL_FS_PSEUDO_LOCK
> int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
>
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 66e613906f3e..472754d082cb 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -846,14 +846,17 @@ static struct mon_evt all_events[QOS_NUM_EVENTS] = {
> [QOS_L3_OCCUP_EVENT_ID] = {
> .name = "llc_occupancy",
> .evtid = QOS_L3_OCCUP_EVENT_ID,
> + .rid = RDT_RESOURCE_L3,
> },
> [QOS_L3_MBM_TOTAL_EVENT_ID] = {
> .name = "mbm_total_bytes",
> .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
> + .rid = RDT_RESOURCE_L3,
> },
> [QOS_L3_MBM_LOCAL_EVENT_ID] = {
> .name = "mbm_local_bytes",
> .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
> + .rid = RDT_RESOURCE_L3,
> },
> };
>
> @@ -878,22 +881,29 @@ char *rdt_event_name(enum resctrl_event_id evt)
> }
>
> /*
> - * Initialize the event list for the resource.
> + * Initialize the event list for all mon_capable resources.
> *
> - * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> - * because as per the SDM the total and local memory bandwidth
> - * are enumerated as part of L3 monitoring.
> - *
> - * mon_put_default_kn_priv_all() also assumes monitor events are only supported
> - * on the L3 resource.
> + * Called on each mount of the resctrl file system when all
> + * events have been enumerated. Only needs to build the per-resource
> + * event lists once.
> */
> -static void l3_mon_evt_init(struct rdt_resource *r)
> +void resctrl_init_mon_events(void)
> {
> + struct rdt_resource *r;
> + static bool only_once;
> int evt;
>
> - INIT_LIST_HEAD(&r->evt_list);
> + if (only_once)
> + return;
> + only_once = true;
> +
> + for_each_mon_capable_rdt_resource(r)
> + INIT_LIST_HEAD(&r->evt_list);
>
> for_each_set_bit(evt, rdt_mon_features, QOS_NUM_EVENTS) {
This is fs code so this needs to be done without peeking into
rdt_mon_features.
> + r = resctrl_arch_get_resource(all_events[evt].rid);
> + if (!r->mon_capable)
> + continue;
> list_add_tail(&all_events[evt].list, &r->evt_list);
> }
> }
> @@ -922,8 +932,6 @@ int resctrl_mon_resource_init(void)
> if (ret)
> return ret;
>
> - l3_mon_evt_init(r);
> -
> if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> all_events[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
> resctrl_file_fflags_init("mbm_total_bytes_config",
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 8d15d53fae76..1433fc098a90 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -2574,6 +2574,8 @@ static int rdt_get_tree(struct fs_context *fc)
> goto out;
> }
>
> + resctrl_init_mon_events();
> +
> ret = rdtgroup_setup_root(ctx);
> if (ret)
> goto out;
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 04/26] fs/resctrl: Set up Kconfig options for telemetry events
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (2 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 03/26] fs/resctrl: Change how events are initialized Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:23 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 05/26] x86/rectrl: Fake OOBMSM interface Tony Luck
` (23 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Enumeration support is provided by the Intel PMT subsystem.
arch/x86 selects this option based on:
X86_64: Counter registers are in MMIO space. There is no readq()
function on 32-bit. Emulation is possible with readl(), but there
are races. Running 32-bit kernels on systems that support this
feature seems pointless.
CPU_SUP_INTEL: It is an Intel specific feature.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/Kconfig | 1 +
drivers/platform/x86/intel/pmt/Kconfig | 6 ++++++
2 files changed, 7 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ea29d22a621f..44a195ee7a42 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -504,6 +504,7 @@ config X86_CPU_RESCTRL
bool "x86 CPU resource control support"
depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
depends on MISC_FILESYSTEMS
+ select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL)
select ARCH_HAS_CPU_RESCTRL
select RESCTRL_FS
select RESCTRL_FS_PSEUDO_LOCK
diff --git a/drivers/platform/x86/intel/pmt/Kconfig b/drivers/platform/x86/intel/pmt/Kconfig
index e916fc966221..b282910b49ef 100644
--- a/drivers/platform/x86/intel/pmt/Kconfig
+++ b/drivers/platform/x86/intel/pmt/Kconfig
@@ -38,3 +38,9 @@ config INTEL_PMT_CRASHLOG
To compile this driver as a module, choose M here: the module
will be called intel_pmt_crashlog.
+
+config INTEL_AET_RESCTRL
+ bool
+ help
+ Architecture config should "select" this option to enable
+ support for RMID telemtry events in the resctrl file system.
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 04/26] fs/resctrl: Set up Kconfig options for telemetry events
2025-04-07 23:40 ` [PATCH v3 04/26] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
@ 2025-04-18 21:23 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:23 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Enumeration support is provided by the Intel PMT subsystem.
Needs context before jumping in with solution.
>
> arch/x86 selects this option based on:
>
> X86_64: Counter registers are in MMIO space. There is no readq()
> function on 32-bit. Emulation is possible with readl(), but there
> are races. Running 32-bit kernels on systems that support this
> feature seems pointless.
>
> CPU_SUP_INTEL: It is an Intel specific feature.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/Kconfig | 1 +
> drivers/platform/x86/intel/pmt/Kconfig | 6 ++++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index ea29d22a621f..44a195ee7a42 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -504,6 +504,7 @@ config X86_CPU_RESCTRL
> bool "x86 CPU resource control support"
> depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> depends on MISC_FILESYSTEMS
> + select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL)
> select ARCH_HAS_CPU_RESCTRL
> select RESCTRL_FS
> select RESCTRL_FS_PSEUDO_LOCK
> diff --git a/drivers/platform/x86/intel/pmt/Kconfig b/drivers/platform/x86/intel/pmt/Kconfig
> index e916fc966221..b282910b49ef 100644
> --- a/drivers/platform/x86/intel/pmt/Kconfig
> +++ b/drivers/platform/x86/intel/pmt/Kconfig
> @@ -38,3 +38,9 @@ config INTEL_PMT_CRASHLOG
>
> To compile this driver as a module, choose M here: the module
> will be called intel_pmt_crashlog.
> +
> +config INTEL_AET_RESCTRL
> + bool
Usually these settings come with some "depends on" ... no dependencies needed?
> + help
> + Architecture config should "select" this option to enable
> + support for RMID telemtry events in the resctrl file system.
telemtry -> telemetry
Please define what is meant with "RMID telemetry events" before this statement.
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 05/26] x86/rectrl: Fake OOBMSM interface
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (3 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 04/26] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:27 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 06/26] fs-x86/rectrl: Improve domain type checking Tony Luck
` (22 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Real version is coming soon ... this is here so the remaining parts
will build (and run ... assuming a 2 socket system that supports RDT
monitoring ... only missing part is that the event counters just
report fixed values).
Just for ease of testing and RFC discussion.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
.../cpu/resctrl/fake_intel_aet_features.h | 73 ++++++++++++++++
.../cpu/resctrl/fake_intel_aet_features.c | 87 +++++++++++++++++++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
3 files changed, 161 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
new file mode 100644
index 000000000000..c835c4108abc
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Bits stolen from OOBMSM VSEC discovery code */
+
+enum pmt_feature_id {
+ FEATURE_INVALID = 0x0,
+ FEATURE_PER_CORE_PERF_TELEM = 0x1,
+ FEATURE_PER_CORE_ENV_TELEM = 0x2,
+ FEATURE_PER_RMID_PERF_TELEM = 0x3,
+ FEATURE_ACCEL_TELEM = 0x4,
+ FEATURE_UNCORE_TELEM = 0x5,
+ FEATURE_CRASH_LOG = 0x6,
+ FEATURE_PETE_LOG = 0x7,
+ FEATURE_TPMI_CTRL = 0x8,
+ FEATURE_RESERVED = 0x9,
+ FEATURE_TRACING = 0xA,
+ FEATURE_PER_RMID_ENERGY_TELEM = 0xB,
+ FEATURE_MAX = 0xB,
+};
+
+/**
+ * struct oobmsm_plat_info - Platform information for a device instance
+ * @cdie_mask: Mask of all compute dies in the partition
+ * @package_id: CPU Package id
+ * @partition: Package partition id when multiple VSEC PCI devices per package
+ * @segment: PCI segment ID
+ * @bus_number: PCI bus number
+ * @device_number: PCI device number
+ * @function_number: PCI function number
+ *
+ * Structure to store platform data for a OOBMSM device instance.
+ */
+struct oobmsm_plat_info {
+ u16 cdie_mask;
+ u8 package_id;
+ u8 partition;
+ u8 segment;
+ u8 bus_number;
+ u8 device_number;
+ u8 function_number;
+};
+
+enum oobmsm_supplier_type {
+ OOBMSM_SUP_PLAT_INFO,
+ OOBMSM_SUP_DISC_INFO,
+ OOBMSM_SUP_S3M_SIMICS,
+ OOBMSM_SUP_TYPE_MAX
+};
+
+struct oobmsm_mapping_supplier {
+ struct device *supplier_dev[OOBMSM_SUP_TYPE_MAX];
+ struct oobmsm_plat_info plat_info;
+ unsigned long features;
+};
+
+struct telemetry_region {
+ struct oobmsm_plat_info plat_info;
+ void __iomem *addr;
+ size_t size;
+ u32 guid;
+ u32 num_rmids;
+};
+
+struct pmt_feature_group {
+ enum pmt_feature_id id;
+ int count;
+ struct kref kref;
+ struct telemetry_region regions[];
+};
+
+struct pmt_feature_group *intel_pmt_get_regions_by_feature(enum pmt_feature_id id);
+
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group);
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
new file mode 100644
index 000000000000..5a16db67c7b8
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/cleanup.h>
+#include <linux/minmax.h>
+#include <linux/slab.h>
+#include "fake_intel_aet_features.h"
+#include <linux/intel_vsec.h>
+#include <linux/resctrl.h>
+
+#include "internal.h"
+
+/* Amount of memory for each fake MMIO space */
+#define ENERGY_QWORDS ((576 * 2) + 3)
+#define ENERGY_SIZE (ENERGY_QWORDS * 8)
+#define PERF_QWORDS ((576 * 7) + 3)
+#define PERF_SIZE (PERF_QWORDS * 8)
+
+static long pg[4 * ENERGY_QWORDS + 2 * PERF_QWORDS];
+
+/*
+ * Fill the fake MMIO space with all different values,
+ * all with BIT(63) set to indicate valid entries.
+ */
+static int __init fill(void)
+{
+ u64 val = 0;
+
+ for (int i = 0; i < sizeof(pg); i += sizeof(val)) {
+ pg[i / sizeof(val)] = BIT_ULL(63) + val;
+ val++;
+ }
+ return 0;
+}
+device_initcall(fill);
+
+#define PKG_REGION(_entry, _guid, _addr, _size, _pkg, _num_rmids) \
+ [_entry] = { .guid = _guid, .addr = (void __iomem *)_addr, \
+ .num_rmids = _num_rmids, \
+ .size = _size, .plat_info = { .package_id = _pkg }}
+
+/*
+ * Set up a fake return for call to:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
+ * Pretend there are two aggregators on each of the sockets to test
+ * the code that sums over multiple aggregators.
+ */
+static struct pmt_feature_group fake_energy = {
+ .count = 4,
+ .regions = {
+ PKG_REGION(0, 0x26696143, &pg[0 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
+ PKG_REGION(1, 0x26696143, &pg[1 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
+ PKG_REGION(2, 0x26696143, &pg[2 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64),
+ PKG_REGION(3, 0x26696143, &pg[3 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64)
+ }
+};
+
+/*
+ * Fake return for:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
+ */
+static struct pmt_feature_group fake_perf = {
+ .count = 2,
+ .regions = {
+ PKG_REGION(0, 0x26557651, &pg[4 * ENERGY_QWORDS + 0 * PERF_QWORDS], PERF_SIZE, 0, 576),
+ PKG_REGION(1, 0x26557651, &pg[4 * ENERGY_QWORDS + 1 * PERF_QWORDS], PERF_SIZE, 1, 576)
+ }
+};
+
+struct pmt_feature_group *
+intel_pmt_get_regions_by_feature(enum pmt_feature_id id)
+{
+ switch (id) {
+ case FEATURE_PER_RMID_ENERGY_TELEM:
+ return &fake_energy;
+ case FEATURE_PER_RMID_PERF_TELEM:
+ return &fake_perf;
+ default:
+ return ERR_PTR(-ENOENT);
+ }
+ return ERR_PTR(-ENOENT);
+}
+
+/*
+ * Nothing needed for the "put" function.
+ */
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group)
+{
+}
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index 909be78ec6da..c56d3acf8ac7 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_INTEL_AET_RESCTRL) += fake_intel_aet_features.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
# To allow define_trace.h's recursive include:
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 05/26] x86/rectrl: Fake OOBMSM interface
2025-04-07 23:40 ` [PATCH v3 05/26] x86/rectrl: Fake OOBMSM interface Tony Luck
@ 2025-04-18 21:27 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:27 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony:
(deja vu ... lets try this comment again ... x86/rectrl -> x86/resctrl)
On 4/7/25 4:40 PM, Tony Luck wrote:
...
> diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
> new file mode 100644
> index 000000000000..5a16db67c7b8
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
> @@ -0,0 +1,87 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/cleanup.h>
> +#include <linux/minmax.h>
> +#include <linux/slab.h>
> +#include "fake_intel_aet_features.h"
> +#include <linux/intel_vsec.h>
> +#include <linux/resctrl.h>
> +
> +#include "internal.h"
> +
> +/* Amount of memory for each fake MMIO space */
> +#define ENERGY_QWORDS ((576 * 2) + 3)
> +#define ENERGY_SIZE (ENERGY_QWORDS * 8)
> +#define PERF_QWORDS ((576 * 7) + 3)
> +#define PERF_SIZE (PERF_QWORDS * 8)
Could you please add explanations for the magic numbers?
For example, why are both energy and perf using 576?
> +
> +static long pg[4 * ENERGY_QWORDS + 2 * PERF_QWORDS];
> +
> +/*
> + * Fill the fake MMIO space with all different values,
> + * all with BIT(63) set to indicate valid entries.
> + */
> +static int __init fill(void)
> +{
> + u64 val = 0;
> +
> + for (int i = 0; i < sizeof(pg); i += sizeof(val)) {
> + pg[i / sizeof(val)] = BIT_ULL(63) + val;
> + val++;
> + }
> + return 0;
> +}
> +device_initcall(fill);
> +
> +#define PKG_REGION(_entry, _guid, _addr, _size, _pkg, _num_rmids) \
> + [_entry] = { .guid = _guid, .addr = (void __iomem *)_addr, \
> + .num_rmids = _num_rmids, \
> + .size = _size, .plat_info = { .package_id = _pkg }}
> +
> +/*
> + * Set up a fake return for call to:
> + * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
> + * Pretend there are two aggregators on each of the sockets to test
> + * the code that sums over multiple aggregators.
> + */
> +static struct pmt_feature_group fake_energy = {
> + .count = 4,
> + .regions = {
> + PKG_REGION(0, 0x26696143, &pg[0 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
> + PKG_REGION(1, 0x26696143, &pg[1 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
> + PKG_REGION(2, 0x26696143, &pg[2 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64),
> + PKG_REGION(3, 0x26696143, &pg[3 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64)
> + }
> +};
> +
> +/*
> + * Fake return for:
> + * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
> + */
> +static struct pmt_feature_group fake_perf = {
> + .count = 2,
> + .regions = {
> + PKG_REGION(0, 0x26557651, &pg[4 * ENERGY_QWORDS + 0 * PERF_QWORDS], PERF_SIZE, 0, 576),
> + PKG_REGION(1, 0x26557651, &pg[4 * ENERGY_QWORDS + 1 * PERF_QWORDS], PERF_SIZE, 1, 576)
> + }
> +};
> +
> +struct pmt_feature_group *
> +intel_pmt_get_regions_by_feature(enum pmt_feature_id id)
> +{
> + switch (id) {
> + case FEATURE_PER_RMID_ENERGY_TELEM:
> + return &fake_energy;
> + case FEATURE_PER_RMID_PERF_TELEM:
> + return &fake_perf;
> + default:
> + return ERR_PTR(-ENOENT);
> + }
> + return ERR_PTR(-ENOENT);
Not reachable.
> +}
> +
> +/*
> + * Nothing needed for the "put" function.
> + */
> +void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group)
> +{
> +}
> diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
> index 909be78ec6da..c56d3acf8ac7 100644
> --- a/arch/x86/kernel/cpu/resctrl/Makefile
> +++ b/arch/x86/kernel/cpu/resctrl/Makefile
> @@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
> obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
> +obj-$(CONFIG_INTEL_AET_RESCTRL) += fake_intel_aet_features.o
> obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
>
> # To allow define_trace.h's recursive include:
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 06/26] fs-x86/rectrl: Improve domain type checking
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (4 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 05/26] x86/rectrl: Fake OOBMSM interface Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:40 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
` (21 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
The rdt_domain_hdr structure is used in both control and monitor
domain structures to provide common methods for operations such as
adding a CPU to a domain, removing a CPU from a domain, accessing
the mask of all CPUs in a domain.
The "type" field provides a simple check whether a domain is a
control or monitor domain so that programming errors operating
on domains will be quickly caught.
To prepare for additional domain types that depend on the rdt_resource
to which they are connected add the resource id into the type.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 19 ++++++++++---------
arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++------
fs/resctrl/ctrlmondata.c | 2 +-
3 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d6a926b6fc0e..177f9879bae1 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -111,11 +111,6 @@ struct resctrl_staged_config {
bool have_new_ctrl;
};
-enum resctrl_domain_type {
- RESCTRL_CTRL_DOMAIN,
- RESCTRL_MON_DOMAIN,
-};
-
/**
* struct rdt_domain_hdr - common header for different domain types
* @list: all instances of this resource
@@ -124,12 +119,18 @@ enum resctrl_domain_type {
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
- struct list_head list;
- int id;
- enum resctrl_domain_type type;
- struct cpumask cpu_mask;
+ struct list_head list;
+ int id;
+ u32 type;
+ struct cpumask cpu_mask;
};
+/* Bitfields in rdt_domain_hdr.type */
+#define DOMTYPE_RID GENMASK(8, 0)
+#define DOMTYPE_CTRL BIT(9)
+#define DOMTYPE_MON BIT(10)
+#define DOMTYPE(rid, type) (((rid) & DOMTYPE_RID) | (type))
+
/**
* struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
* @hdr: common header for different domain types
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6f4a3bd02a42..d82a4a2db699 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -457,7 +457,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_CTRL)))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -473,7 +473,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
- d->hdr.type = RESCTRL_CTRL_DOMAIN;
+ d->hdr.type = DOMTYPE(r->rid, DOMTYPE_CTRL);
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
rdt_domain_reconfigure_cdp(r);
@@ -512,7 +512,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
@@ -526,7 +526,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
- d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.type = DOMTYPE(r->rid, DOMTYPE_MON);
d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!d->ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -582,7 +582,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_CTRL)))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -628,7 +628,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index ce02e961a6c3..0c245af0ff42 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -614,7 +614,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
* the resource to find the domain with "domid".
*/
hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+ if (!hdr || WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON))) {
ret = -ENOENT;
goto out;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 06/26] fs-x86/rectrl: Improve domain type checking
2025-04-07 23:40 ` [PATCH v3 06/26] fs-x86/rectrl: Improve domain type checking Tony Luck
@ 2025-04-18 21:40 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:40 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> The rdt_domain_hdr structure is used in both control and monitor
> domain structures to provide common methods for operations such as
> adding a CPU to a domain, removing a CPU from a domain, accessing
> the mask of all CPUs in a domain.
>
> The "type" field provides a simple check whether a domain is a
> control or monitor domain so that programming errors operating
> on domains will be quickly caught.
>
> To prepare for additional domain types that depend on the rdt_resource
> to which they are connected add the resource id into the type.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 19 ++++++++++---------
> arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++------
> fs/resctrl/ctrlmondata.c | 2 +-
> 3 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index d6a926b6fc0e..177f9879bae1 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -111,11 +111,6 @@ struct resctrl_staged_config {
> bool have_new_ctrl;
> };
>
> -enum resctrl_domain_type {
> - RESCTRL_CTRL_DOMAIN,
> - RESCTRL_MON_DOMAIN,
> -};
> -
> /**
> * struct rdt_domain_hdr - common header for different domain types
> * @list: all instances of this resource
> @@ -124,12 +119,18 @@ enum resctrl_domain_type {
> * @cpu_mask: which CPUs share this resource
> */
> struct rdt_domain_hdr {
> - struct list_head list;
> - int id;
> - enum resctrl_domain_type type;
> - struct cpumask cpu_mask;
> + struct list_head list;
> + int id;
> + u32 type;
> + struct cpumask cpu_mask;
> };
>
> +/* Bitfields in rdt_domain_hdr.type */
> +#define DOMTYPE_RID GENMASK(8, 0)
> +#define DOMTYPE_CTRL BIT(9)
> +#define DOMTYPE_MON BIT(10)
> +#define DOMTYPE(rid, type) (((rid) & DOMTYPE_RID) | (type))
> +
This seems unnecessarily complicated to me. Why not just add the
resource id to the domain header?
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (5 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 06/26] fs-x86/rectrl: Improve domain type checking Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:51 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 08/26] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
` (20 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
To prepare for additional types of monitoring domains, move all the
L3 specific initialization into a helper function.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 57 ++++++++++++++++++------------
1 file changed, 35 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d82a4a2db699..703423b0be0e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -493,33 +493,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
}
}
-static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct list_head *add_pos = NULL;
struct rdt_hw_mon_domain *hw_dom;
- struct rdt_domain_hdr *hdr;
struct rdt_mon_domain *d;
int err;
- lockdep_assert_held(&domain_list_lock);
-
- if (id < 0) {
- pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->mon_scope, r->name);
- return;
- }
-
- hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
- if (hdr) {
- if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
-
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- return;
- }
-
hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
if (!hw_dom)
return;
@@ -552,6 +531,40 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
}
}
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_mon_domain *d;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
+ return;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ return;
+ }
+
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ setup_l3_mon_domain(cpu, id, r, add_pos);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ }
+}
+
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
if (r->alloc_capable)
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
2025-04-07 23:40 ` [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
@ 2025-04-18 21:51 ` Reinette Chatre
2025-04-21 20:01 ` Luck, Tony
0 siblings, 1 reply; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:51 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> To prepare for additional types of monitoring domains, move all the
> L3 specific initialization into a helper function.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 57 ++++++++++++++++++------------
> 1 file changed, 35 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index d82a4a2db699..703423b0be0e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -493,33 +493,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> }
> }
>
> -static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> +static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
> {
> - int id = get_domain_id_from_scope(cpu, r->mon_scope);
> - struct list_head *add_pos = NULL;
> struct rdt_hw_mon_domain *hw_dom;
> - struct rdt_domain_hdr *hdr;
> struct rdt_mon_domain *d;
> int err;
>
> - lockdep_assert_held(&domain_list_lock);
> -
> - if (id < 0) {
> - pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> - cpu, r->mon_scope, r->name);
> - return;
> - }
> -
> - hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
> - if (hdr) {
> - if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
> - return;
> - d = container_of(hdr, struct rdt_mon_domain, hdr);
> -
> - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> - return;
> - }
> -
Many functions called by this new "setup_l3_mon_domain()" are specific to L3 resource but
since L3 resource has so far been the only one supported the naming has been generic. Now that
this function is made resource specific I think it will help make the code clear if the
L3 specific functions called by it are also renamed. For example, mon_domain_free() can
be renamed to free_l3_mon_domain() to match the "setup_l3_mon_domain()" introduced here. Also
arch_mon_domain_online() -> arch_l3_mon_domain_online().
Seems like resctrl_online_mon_domain() is only temporarily specific to L3 in this series
(would be helpful if such details are explained in changelog) ... to start it could
include a check that ensures it is only called on L3 resource and then it will help
to clarify what this patch does and how following work builds on it.
> hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> if (!hw_dom)
> return;
> @@ -552,6 +531,40 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> }
> }
>
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
2025-04-18 21:51 ` Reinette Chatre
@ 2025-04-21 20:01 ` Luck, Tony
2025-04-22 18:18 ` Reinette Chatre
0 siblings, 1 reply; 67+ messages in thread
From: Luck, Tony @ 2025-04-21 20:01 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
On Fri, Apr 18, 2025 at 02:51:01PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 4/7/25 4:40 PM, Tony Luck wrote:
> > To prepare for additional types of monitoring domains, move all the
> > L3 specific initialization into a helper function.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > arch/x86/kernel/cpu/resctrl/core.c | 57 ++++++++++++++++++------------
> > 1 file changed, 35 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index d82a4a2db699..703423b0be0e 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -493,33 +493,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> > }
> > }
> >
> > -static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> > +static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
> > {
> > - int id = get_domain_id_from_scope(cpu, r->mon_scope);
> > - struct list_head *add_pos = NULL;
> > struct rdt_hw_mon_domain *hw_dom;
> > - struct rdt_domain_hdr *hdr;
> > struct rdt_mon_domain *d;
> > int err;
> >
> > - lockdep_assert_held(&domain_list_lock);
> > -
> > - if (id < 0) {
> > - pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
> > - cpu, r->mon_scope, r->name);
> > - return;
> > - }
> > -
> > - hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
> > - if (hdr) {
> > - if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
> > - return;
> > - d = container_of(hdr, struct rdt_mon_domain, hdr);
> > -
> > - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> > - return;
> > - }
> > -
>
> Many functions called by this new "setup_l3_mon_domain()" are specific to L3 resource but
> since L3 resource has so far been the only one supported the naming has been generic. Now that
> this function is made resource specific I think it will help make the code clear if the
> L3 specific functions called by it are also renamed. For example, mon_domain_free() can
> be renamed to free_l3_mon_domain() to match the "setup_l3_mon_domain()" introduced here. Also
> arch_mon_domain_online() -> arch_l3_mon_domain_online().
What about "struct rdt_mon_domain"? It is now specific to L3. Should I
change that to rdt_l3_mon_domain" as well (60 lines affected)?
Ditto for rdt_hw_mon_domain (but only 12 lines for this one).
> Seems like resctrl_online_mon_domain() is only temporarily specific to L3 in this series
> (would be helpful if such details are explained in changelog) ... to start it could
> include a check that ensures it is only called on L3 resource and then it will help
> to clarify what this patch does and how following work builds on it.
>
> > hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
> > if (!hw_dom)
> > return;
> > @@ -552,6 +531,40 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> > }
> > }
> >
> Reinette
>
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
2025-04-21 20:01 ` Luck, Tony
@ 2025-04-22 18:18 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-22 18:18 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Tony,
On 4/21/25 1:01 PM, Luck, Tony wrote:
> On Fri, Apr 18, 2025 at 02:51:01PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 4/7/25 4:40 PM, Tony Luck wrote:
>>> To prepare for additional types of monitoring domains, move all the
>>> L3 specific initialization into a helper function.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>> arch/x86/kernel/cpu/resctrl/core.c | 57 ++++++++++++++++++------------
>>> 1 file changed, 35 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index d82a4a2db699..703423b0be0e 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -493,33 +493,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>>> }
>>> }
>>>
>>> -static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>>> +static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
>>> {
>>> - int id = get_domain_id_from_scope(cpu, r->mon_scope);
>>> - struct list_head *add_pos = NULL;
>>> struct rdt_hw_mon_domain *hw_dom;
>>> - struct rdt_domain_hdr *hdr;
>>> struct rdt_mon_domain *d;
>>> int err;
>>>
>>> - lockdep_assert_held(&domain_list_lock);
>>> -
>>> - if (id < 0) {
>>> - pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
>>> - cpu, r->mon_scope, r->name);
>>> - return;
>>> - }
>>> -
>>> - hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
>>> - if (hdr) {
>>> - if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
>>> - return;
>>> - d = container_of(hdr, struct rdt_mon_domain, hdr);
>>> -
>>> - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>>> - return;
>>> - }
>>> -
>>
>> Many functions called by this new "setup_l3_mon_domain()" are specific to L3 resource but
>> since L3 resource has so far been the only one supported the naming has been generic. Now that
>> this function is made resource specific I think it will help make the code clear if the
>> L3 specific functions called by it are also renamed. For example, mon_domain_free() can
>> be renamed to free_l3_mon_domain() to match the "setup_l3_mon_domain()" introduced here. Also
>> arch_mon_domain_online() -> arch_l3_mon_domain_online().
>
> What about "struct rdt_mon_domain"? It is now specific to L3. Should I
> change that to rdt_l3_mon_domain" as well (60 lines affected)?
>
> Ditto for rdt_hw_mon_domain (but only 12 lines for this one).
Thank you for considering this. My vote would be "yes" for both. I think it will help
understand and maintain the code if naming helps to make obvious what data/code applies to.
Monitoring has been synonymous with L3 monitoring for so long that there may be many
instances of this implicit assumption.
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 08/26] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (6 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 21:53 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 09/26] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
` (19 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Different types of domains require different actions when the last
CPU in the domain is removed.
Refactor to make it easy to add new actions for new types of domains.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 703423b0be0e..7080447028b0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -644,17 +644,19 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON)))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- hw_dom = resctrl_to_arch_mon_dom(d);
+ cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
+ return;
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
mon_domain_free(hw_dom);
-
- return;
+ break;
}
}
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 08/26] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-04-07 23:40 ` [PATCH v3 08/26] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-04-18 21:53 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:53 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Different types of domains require different actions when the last
Please expand/explain what is meant with "different types of domains".
> CPU in the domain is removed.
>
> Refactor to make it easy to add new actions for new types of domains.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 09/26] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (7 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 08/26] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 22:42 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
` (18 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Functions that don't need the internal details of the rdt_mon_domain
can operate on just the rdt_domain_hdr.
Add sanity checks where container_of() is used to find the surrounding
domain structure that hdr has the expected type.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 2 +-
fs/resctrl/rdtgroup.c | 66 +++++++++++++++++++++---------
3 files changed, 48 insertions(+), 22 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 177f9879bae1..0fce626605b9 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -412,7 +412,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
void resctrl_online_cpu(unsigned int cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7080447028b0..59844fd7105f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -523,7 +523,7 @@ static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct
list_add_tail_rcu(&d->hdr.list, add_pos);
- err = resctrl_online_mon_domain(r, d);
+ err = resctrl_online_mon_domain(r, &d->hdr);
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 1433fc098a90..5011e404798a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2990,7 +2990,7 @@ static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subn
* when last domain being summed is removed.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
char subname[32];
@@ -2998,9 +2998,16 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
char name[32];
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
- if (snc_mode)
- sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ if (snc_mode) {
+ struct rdt_mon_domain *d;
+
+ WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON));
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
+ } else {
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ }
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mon_rmdir_one_subdir(prgrp, name, subname);
@@ -3010,20 +3017,29 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
-static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp,
bool do_sum)
{
struct rmid_read rr = {0};
+ struct rdt_mon_domain *d;
struct mon_data *priv;
struct mon_evt *mevt;
+ int domid;
int ret;
if (WARN_ON(list_empty(&r->evt_list)))
return -EPERM;
+ if (r->rid == RDT_RESOURCE_L3) {
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ domid = do_sum ? d->ci->id : d->hdr.id;
+ } else {
+ domid = hdr->id;
+ }
+
list_for_each_entry(mevt, &r->evt_list, list) {
- priv = mon_get_kn_priv(r->rid, do_sum ? d->ci->id : d->hdr.id, mevt, do_sum);
+ priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
if (WARN_ON_ONCE(!priv))
return -EINVAL;
@@ -3031,7 +3047,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
if (ret)
return ret;
- if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
+ if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
}
@@ -3039,10 +3055,11 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
}
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_mon_domain *d,
+ struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
+ struct rdt_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
@@ -3050,7 +3067,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
lockdep_assert_held(&rdtgroup_mutex);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
+ if (snc_mode) {
+ WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON));
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ } else {
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ }
kn = kernfs_find_and_get(parent_kn, name);
if (kn) {
/*
@@ -3066,7 +3089,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
ret = rdtgroup_kn_set_ugid(kn);
if (ret)
goto out_destroy;
- ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
+ ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
if (ret)
goto out_destroy;
}
@@ -3083,7 +3106,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (ret)
goto out_destroy;
- ret = mon_add_all_files(ckn, d, r, prgrp, false);
+ ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
if (ret)
goto out_destroy;
}
@@ -3101,7 +3124,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3109,12 +3132,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
parent_kn = crgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, crgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
}
}
}
@@ -3123,14 +3146,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_mon_domain *dom;
+ struct rdt_domain_hdr *hdr;
int ret;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(dom, &r->mon_domains, hdr.list) {
- ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
+ list_for_each_entry(hdr, &r->mon_domains, list) {
+ ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
if (ret)
return ret;
}
@@ -3991,7 +4014,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d);
+ rmdir_mondata_subdir_allrdtgrp(r, &d->hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
@@ -4074,10 +4097,13 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
return err;
}
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
+ struct rdt_mon_domain *d;
int err;
+ WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON));
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
mutex_lock(&rdtgroup_mutex);
err = domain_setup_mon_state(r, d);
@@ -4100,7 +4126,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
* If resctrl is mounted, add per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- mkdir_mondata_subdir_allrdtgrp(r, d);
+ mkdir_mondata_subdir_allrdtgrp(r, hdr);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 09/26] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr
2025-04-07 23:40 ` [PATCH v3 09/26] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
@ 2025-04-18 22:42 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 22:42 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> @@ -4074,10 +4097,13 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
> return err;
> }
>
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> + struct rdt_mon_domain *d;
> int err;
>
> + WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON));
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> mutex_lock(&rdtgroup_mutex);
It is a bit unexpected to see code added outside of the mutex_lock(). This looks
fine since it is accessing domain list from the hotplug handlers but since it is fs
code that is jumped to by arch code and fs code relies on arch code for the locking
I'd like to suggest a:
lockdep_assert_cpus_held();
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (8 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 09/26] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 22:54 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats Tony Luck
` (17 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Add a flag to each instance of struct mon_event to indicate that there
is no need for cross-processor interrupts to read this event from a CPU
in a specific rdt_mon_domain.
The flag is copied to struct mon_data for ease of access when a user
reads an event file invoking rdtgroup_mondata_show().
Copied again into struct rmid_read in mon_event_read() for use by
sanity checks in __mon_event_count().
When the flag is set allow choice from cpu_online_mask. This makes the
smp_call*() functions default to the current CPU.
Suggested-by: James Morse <james.morse@arm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 8 +++++++-
fs/resctrl/ctrlmondata.c | 10 +++++++---
fs/resctrl/monitor.c | 4 ++--
fs/resctrl/rdtgroup.c | 1 +
4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 08dbf89939ac..74a77794364d 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -72,6 +72,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @evtid: event id
* @name: name of the event
* @configurable: true if the event is configurable
+ * @any_cpu: true if this event can be read from any CPU
* @list: entry in &rdt_resource->evt_list
*/
struct mon_evt {
@@ -79,6 +80,7 @@ struct mon_evt {
enum resctrl_res_level rid;
char *name;
bool configurable;
+ bool any_cpu;
struct list_head list;
};
@@ -93,6 +95,7 @@ struct mon_evt {
* the event file belongs. When @sum is one this
* is the id of the L3 cache that all domains to be
* summed share.
+ * @any_cpu: true if this event can be read from any CPU
*
* Stored in the kernfs kn->priv field, readers and writers must hold
* rdtgroup_mutex.
@@ -103,6 +106,7 @@ struct mon_data {
enum resctrl_event_id evtid;
unsigned int sum;
unsigned int domid;
+ bool any_cpu;
};
/**
@@ -115,6 +119,7 @@ struct mon_data {
* domains in @r sharing L3 @ci.id
* @evtid: Which monitor event to read.
* @first: Initialize MBM counter when true.
+ * @any_cpu: When true read can be executed on any CPU.
* @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
* @err: Error encountered when reading counter.
* @val: Returned value of event counter. If @rgrp is a parent resource group,
@@ -129,6 +134,7 @@ struct rmid_read {
struct rdt_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
+ bool any_cpu;
struct cacheinfo *ci;
int err;
u64 val;
@@ -358,7 +364,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first);
+ const cpumask_t *cpumask, int evtid, int first);
int resctrl_mon_resource_init(void);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 0c245af0ff42..cd77960657f0 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -525,7 +525,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first)
+ const cpumask_t *cpumask, int evtid, int first)
{
int cpu;
@@ -571,6 +571,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
+ const cpumask_t *mask;
struct mon_data *md;
int ret = 0;
@@ -589,6 +590,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
resid = md->rid;
domid = md->domid;
evtid = md->evtid;
+ rr.any_cpu = md->any_cpu;
r = resctrl_arch_get_resource(resid);
if (md->sum) {
@@ -601,8 +603,9 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->ci->id == domid) {
rr.ci = d->ci;
+ mask = md->any_cpu ? cpu_online_mask : &d->ci->shared_cpu_map;
mon_event_read(&rr, r, NULL, rdtgrp,
- &d->ci->shared_cpu_map, evtid, false);
+ mask, evtid, false);
goto checkresult;
}
}
@@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
goto out;
}
d = container_of(hdr, struct rdt_mon_domain, hdr);
- mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
+ mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
+ mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
}
checkresult:
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 472754d082cb..1cf0b085e07a 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -375,7 +375,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
if (rr->d) {
/* Reading a single domain, must be on a CPU in that domain. */
- if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
+ if (!rr->any_cpu && !cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
return -EINVAL;
rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
rr->evtid, &tval, rr->arch_mon_ctx);
@@ -388,7 +388,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
}
/* Summing domains that share a cache, must be on a CPU for that cache. */
- if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
+ if (!rr->any_cpu && !cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
return -EINVAL;
/*
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 5011e404798a..97c2ba8af930 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2926,6 +2926,7 @@ static struct mon_data *mon_get_kn_priv(int rid, int domid, struct mon_evt *mevt
priv->domid = domid;
priv->sum = do_sum;
priv->evtid = mevt->evtid;
+ priv->any_cpu = mevt->any_cpu;
list_add_tail(&priv->list, &kn_priv_list);
return priv;
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-07 23:40 ` [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
@ 2025-04-18 22:54 ` Reinette Chatre
2025-04-21 20:28 ` Luck, Tony
0 siblings, 1 reply; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 22:54 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Add a flag to each instance of struct mon_event to indicate that there
> is no need for cross-processor interrupts to read this event from a CPU
> in a specific rdt_mon_domain.
>
> The flag is copied to struct mon_data for ease of access when a user
Copy the flag ...
> reads an event file invoking rdtgroup_mondata_show().
>
> Copied again into struct rmid_read in mon_event_read() for use by
> sanity checks in __mon_event_count().
>
> When the flag is set allow choice from cpu_online_mask. This makes the
> smp_call*() functions default to the current CPU.
Please use imperative tone.
>
> Suggested-by: James Morse <james.morse@arm.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> fs/resctrl/internal.h | 8 +++++++-
> fs/resctrl/ctrlmondata.c | 10 +++++++---
> fs/resctrl/monitor.c | 4 ++--
> fs/resctrl/rdtgroup.c | 1 +
> 4 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 08dbf89939ac..74a77794364d 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -72,6 +72,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> * @evtid: event id
> * @name: name of the event
> * @configurable: true if the event is configurable
> + * @any_cpu: true if this event can be read from any CPU
> * @list: entry in &rdt_resource->evt_list
> */
> struct mon_evt {
> @@ -79,6 +80,7 @@ struct mon_evt {
> enum resctrl_res_level rid;
> char *name;
> bool configurable;
> + bool any_cpu;
> struct list_head list;
> };
>
> @@ -93,6 +95,7 @@ struct mon_evt {
> * the event file belongs. When @sum is one this
> * is the id of the L3 cache that all domains to be
> * summed share.
> + * @any_cpu: true if this event can be read from any CPU
> *
> * Stored in the kernfs kn->priv field, readers and writers must hold
> * rdtgroup_mutex.
> @@ -103,6 +106,7 @@ struct mon_data {
> enum resctrl_event_id evtid;
> unsigned int sum;
> unsigned int domid;
> + bool any_cpu;
> };
>
> /**
> @@ -115,6 +119,7 @@ struct mon_data {
> * domains in @r sharing L3 @ci.id
> * @evtid: Which monitor event to read.
> * @first: Initialize MBM counter when true.
> + * @any_cpu: When true read can be executed on any CPU.
> * @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
> * @err: Error encountered when reading counter.
> * @val: Returned value of event counter. If @rgrp is a parent resource group,
> @@ -129,6 +134,7 @@ struct rmid_read {
> struct rdt_mon_domain *d;
> enum resctrl_event_id evtid;
> bool first;
> + bool any_cpu;
> struct cacheinfo *ci;
> int err;
> u64 val;
Duplicating the same property across three structures does not look right. It looks to
me that struct mon_evt should be the "source of truth" for any event and that
these other structures can point to it instead of copying the data?
> @@ -358,7 +364,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg);
>
> void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> - cpumask_t *cpumask, int evtid, int first);
> + const cpumask_t *cpumask, int evtid, int first);
>
> int resctrl_mon_resource_init(void);
>
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index 0c245af0ff42..cd77960657f0 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -525,7 +525,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
>
> void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> - cpumask_t *cpumask, int evtid, int first)
> + const cpumask_t *cpumask, int evtid, int first)
> {
> int cpu;
>
> @@ -571,6 +571,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> u32 resid, evtid, domid;
> struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
> + const cpumask_t *mask;
> struct mon_data *md;
> int ret = 0;
>
> @@ -589,6 +590,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> resid = md->rid;
> domid = md->domid;
> evtid = md->evtid;
> + rr.any_cpu = md->any_cpu;
> r = resctrl_arch_get_resource(resid);
>
> if (md->sum) {
> @@ -601,8 +603,9 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> list_for_each_entry(d, &r->mon_domains, hdr.list) {
> if (d->ci->id == domid) {
> rr.ci = d->ci;
> + mask = md->any_cpu ? cpu_online_mask : &d->ci->shared_cpu_map;
> mon_event_read(&rr, r, NULL, rdtgrp,
> - &d->ci->shared_cpu_map, evtid, false);
> + mask, evtid, false);
> goto checkresult;
> }
> }
> @@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> goto out;
> }
> d = container_of(hdr, struct rdt_mon_domain, hdr);
> - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
> + mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
> + mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
I do not think this accomplishes the goal of this patch. Looking at mon_event_read() it calls
cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU) before any of the smp_*() calls.
cpumask_any_housekeeping()
{
...
if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
cpu = cpumask_any(mask);
...
}
cpumask_any() is just cpumask_first() so it will pick the first CPU in the
online mask that may not be the current CPU.
fwiw ... there are some optimizations planned in this area that I have not yet studied:
https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-18 22:54 ` Reinette Chatre
@ 2025-04-21 20:28 ` Luck, Tony
2025-04-22 18:19 ` Reinette Chatre
0 siblings, 1 reply; 67+ messages in thread
From: Luck, Tony @ 2025-04-21 20:28 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
On Fri, Apr 18, 2025 at 03:54:02PM -0700, Reinette Chatre wrote:
> > @@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> > goto out;
> > }
> > d = container_of(hdr, struct rdt_mon_domain, hdr);
> > - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
> > + mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
> > + mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
>
> I do not think this accomplishes the goal of this patch. Looking at mon_event_read() it calls
> cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU) before any of the smp_*() calls.
>
> cpumask_any_housekeeping()
> {
> ...
> if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
> cpu = cpumask_any(mask);
> ...
> }
>
> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
> online mask that may not be the current CPU.
>
> fwiw ... there are some optimizations planned in this area that I have not yet studied:
> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
I remember Peter complaining[1] about extra context switches when
cpumask_any_housekeeping() was introduced, but it seems that the
discussion died with no fix applied.
The blocking problem is that ARM may not be able to read a counter
on a tick_nohz CPU because it may need to sleep.
Do we need more options for events:
1) Must be read on a CPU in the right domain // Legacy
2) Can be read from any CPU // My addtion
3) Must be read on a "housekeeping" CPU // James' code in upstream
4) Cannot be read on a tick_nohz CPU // Could be combined with 1 or 2?
> Reinette
[1] https://lore.kernel.org/all/20241031142553.3963058-2-peternewman@google.com/
>
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-21 20:28 ` Luck, Tony
@ 2025-04-22 18:19 ` Reinette Chatre
2025-04-23 0:51 ` Luck, Tony
2025-04-23 13:27 ` Peter Newman
0 siblings, 2 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-22 18:19 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Tony,
On 4/21/25 1:28 PM, Luck, Tony wrote:
> On Fri, Apr 18, 2025 at 03:54:02PM -0700, Reinette Chatre wrote:
>>> @@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>>> goto out;
>>> }
>>> d = container_of(hdr, struct rdt_mon_domain, hdr);
>>> - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
>>> + mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
>>> + mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
>>
>> I do not think this accomplishes the goal of this patch. Looking at mon_event_read() it calls
>> cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU) before any of the smp_*() calls.
>>
>> cpumask_any_housekeeping()
>> {
>> ...
>> if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
>> cpu = cpumask_any(mask);
>> ...
>> }
>>
>> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
>> online mask that may not be the current CPU.
>>
>> fwiw ... there are some optimizations planned in this area that I have not yet studied:
>> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
>
> I remember Peter complaining[1] about extra context switches when
> cpumask_any_housekeeping() was introduced, but it seems that the
> discussion died with no fix applied.
The initial complaint was indeed that reading individual events is slower.
The issue is that the intended use case read from many files at frequent
intervals and thus becomes vulnerable to any changes in this area that
really is already a slow path (reading from a file ... taking a mutex ...).
Instead of working on shaving cycles off this path the discussion transitioned
to resctrl providing better support for the underlying use case. I
understood that this is being experimented with [2] and last I heard it
looks promising.
>
> The blocking problem is that ARM may not be able to read a counter
> on a tick_nohz CPU because it may need to sleep.
>
> Do we need more options for events:
>
> 1) Must be read on a CPU in the right domain // Legacy
> 2) Can be read from any CPU // My addtion
> 3) Must be read on a "housekeeping" CPU // James' code in upstream
> 4) Cannot be read on a tick_nohz CPU // Could be combined with 1 or 2?
I do not see needing additional complexity here. I think it will be simpler
to just replace use of cpumask_any_housekeeping() in mon_event_read() with
open code that supports the particular usage. As I understand it is prohibited
for all CPUs to be in tick_nohz_full_mask so it looks to me as though the
existing "if (tick_nohz_full_cpu(cpu))" should never be true (since no CPU is being excluded).
Also, since mon_event_read() has no need to exclude CPUs, just a cpumask_andnot()
should suffice to determine what remains of given mask after accounting for all the
NO_HZ CPUs if tick_nohz_full_enabled().
Reinette
>
>> Reinette
>
> [1] https://lore.kernel.org/all/20241031142553.3963058-2-peternewman@google.com/
>>
[2] https://lore.kernel.org/lkml/CALPaoCgpnVORZfbKVLXDFUZvv8jhpShHPzB3cwdLTZQH1o9ULw@mail.gmail.com/
^ permalink raw reply [flat|nested] 67+ messages in thread* RE: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-22 18:19 ` Reinette Chatre
@ 2025-04-23 0:51 ` Luck, Tony
2025-04-23 3:37 ` Reinette Chatre
2025-04-23 13:27 ` Peter Newman
1 sibling, 1 reply; 67+ messages in thread
From: Luck, Tony @ 2025-04-23 0:51 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Keshavamurthy, Anil S,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
> >> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
> >> online mask that may not be the current CPU.
> >>
> >> fwiw ... there are some optimizations planned in this area that I have not yet studied:
> >> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
> >
> > I remember Peter complaining[1] about extra context switches when
> > cpumask_any_housekeeping() was introduced, but it seems that the
> > discussion died with no fix applied.
>
> The initial complaint was indeed that reading individual events is slower.
>
> The issue is that the intended use case read from many files at frequent
> intervals and thus becomes vulnerable to any changes in this area that
> really is already a slow path (reading from a file ... taking a mutex ...).
>
> Instead of working on shaving cycles off this path the discussion transitioned
> to resctrl providing better support for the underlying use case. I
> understood that this is being experimented with [2] and last I heard it
> looks promising.
>
> >
> > The blocking problem is that ARM may not be able to read a counter
> > on a tick_nohz CPU because it may need to sleep.
> >
> > Do we need more options for events:
> >
> > 1) Must be read on a CPU in the right domain // Legacy
> > 2) Can be read from any CPU // My addtion
> > 3) Must be read on a "housekeeping" CPU // James' code in upstream
> > 4) Cannot be read on a tick_nohz CPU // Could be combined with 1 or 2?
>
> I do not see needing additional complexity here. I think it will be simpler
> to just replace use of cpumask_any_housekeeping() in mon_event_read() with
> open code that supports the particular usage. As I understand it is prohibited
> for all CPUs to be in tick_nohz_full_mask so it looks to me as though the
> existing "if (tick_nohz_full_cpu(cpu))" should never be true (since no CPU is being excluded).
> Also, since mon_event_read() has no need to exclude CPUs, just a cpumask_andnot()
> should suffice to determine what remains of given mask after accounting for all the
> NO_HZ CPUs if tick_nohz_full_enabled().
Maybe there isn’t much complexity to make this "read one counter" better on systems
where reading from any CPU is possible. Taking your advice from the earlier review
the filesystem code can set a flag in the mon_evt structure. struct mon_data and
struct rmid_read can change from holding the event id to holding a pointer to the
mon_evt (as the source of truth).
Then mon_event_read() can just have a simple direct call to mon_event_count()
just before the call to cpumask_any_housekeeping() like this:
if (evt->any_cpu) {
mon_event_count(rr);
goto done;
}
The "goto done" jumps to the resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
at the end of mon_event_read()
Folks can still pursue the bulk read of many counters (though I expect you might want
one file per domain, rather than a single file to report everything).
>
> Reinette
>
> >
> >> Reinette
> >
> > [1] https://lore.kernel.org/all/20241031142553.3963058-2-peternewman@google.com/
> >>
>
> [2] https://lore.kernel.org/lkml/CALPaoCgpnVORZfbKVLXDFUZvv8jhpShHPzB3cwdLTZQH1o9ULw@mail.gmail.com/
-Tony
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-23 0:51 ` Luck, Tony
@ 2025-04-23 3:37 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-23 3:37 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Keshavamurthy, Anil S,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
Hi Tony,
On 4/22/25 5:51 PM, Luck, Tony wrote:
>>>> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
>>>> online mask that may not be the current CPU.
>>>>
>>>> fwiw ... there are some optimizations planned in this area that I have not yet studied:
>>>> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
>>>
>>> I remember Peter complaining[1] about extra context switches when
>>> cpumask_any_housekeeping() was introduced, but it seems that the
>>> discussion died with no fix applied.
>>
>> The initial complaint was indeed that reading individual events is slower.
>>
>> The issue is that the intended use case read from many files at frequent
>> intervals and thus becomes vulnerable to any changes in this area that
>> really is already a slow path (reading from a file ... taking a mutex ...).
>>
>> Instead of working on shaving cycles off this path the discussion transitioned
>> to resctrl providing better support for the underlying use case. I
>> understood that this is being experimented with [2] and last I heard it
>> looks promising.
>>
>>>
>>> The blocking problem is that ARM may not be able to read a counter
>>> on a tick_nohz CPU because it may need to sleep.
>>>
>>> Do we need more options for events:
>>>
>>> 1) Must be read on a CPU in the right domain // Legacy
>>> 2) Can be read from any CPU // My addtion
>>> 3) Must be read on a "housekeeping" CPU // James' code in upstream
>>> 4) Cannot be read on a tick_nohz CPU // Could be combined with 1 or 2?
>>
>> I do not see needing additional complexity here. I think it will be simpler
>> to just replace use of cpumask_any_housekeeping() in mon_event_read() with
>> open code that supports the particular usage. As I understand it is prohibited
>> for all CPUs to be in tick_nohz_full_mask so it looks to me as though the
>> existing "if (tick_nohz_full_cpu(cpu))" should never be true (since no CPU is being excluded).
>> Also, since mon_event_read() has no need to exclude CPUs, just a cpumask_andnot()
>> should suffice to determine what remains of given mask after accounting for all the
>> NO_HZ CPUs if tick_nohz_full_enabled().
>
> Maybe there isn’t much complexity to make this "read one counter" better on systems
> where reading from any CPU is possible. Taking your advice from the earlier review
> the filesystem code can set a flag in the mon_evt structure. struct mon_data and
> struct rmid_read can change from holding the event id to holding a pointer to the
> mon_evt (as the source of truth).
>
> Then mon_event_read() can just have a simple direct call to mon_event_count()
> just before the call to cpumask_any_housekeeping() like this:
>
> if (evt->any_cpu) {
> mon_event_count(rr);
> goto done;
> }
>
> The "goto done" jumps to the resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
> at the end of mon_event_read()
Thanks, this looks great.
>
> Folks can still pursue the bulk read of many counters (though I expect you might want
> one file per domain, rather than a single file to report everything).
I will have to re-read that thread to refresh myself on the discussion but scanning the thread
I did find a summary of points (end of [3]) and this was there.
>
>>
>> Reinette
>>
>>>
>>>> Reinette
>>>
>>> [1] https://lore.kernel.org/all/20241031142553.3963058-2-peternewman@google.com/
>>>>
>>
>> [2] https://lore.kernel.org/lkml/CALPaoCgpnVORZfbKVLXDFUZvv8jhpShHPzB3cwdLTZQH1o9ULw@mail.gmail.com/
>
[3] https://lore.kernel.org/lkml/34fd8713-3430-4e27-a2c2-fd8839f90f5a@intel.com/
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-22 18:19 ` Reinette Chatre
2025-04-23 0:51 ` Luck, Tony
@ 2025-04-23 13:27 ` Peter Newman
2025-04-23 15:47 ` Reinette Chatre
1 sibling, 1 reply; 67+ messages in thread
From: Peter Newman @ 2025-04-23 13:27 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Reinette,
On Tue, Apr 22, 2025 at 8:20 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Tony,
>
> On 4/21/25 1:28 PM, Luck, Tony wrote:
> > On Fri, Apr 18, 2025 at 03:54:02PM -0700, Reinette Chatre wrote:
> >>> @@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> >>> goto out;
> >>> }
> >>> d = container_of(hdr, struct rdt_mon_domain, hdr);
> >>> - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
> >>> + mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
> >>> + mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
> >>
> >> I do not think this accomplishes the goal of this patch. Looking at mon_event_read() it calls
> >> cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU) before any of the smp_*() calls.
> >>
> >> cpumask_any_housekeeping()
> >> {
> >> ...
> >> if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
> >> cpu = cpumask_any(mask);
> >> ...
> >> }
> >>
> >> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
> >> online mask that may not be the current CPU.
> >>
> >> fwiw ... there are some optimizations planned in this area that I have not yet studied:
> >> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
> >
> > I remember Peter complaining[1] about extra context switches when
> > cpumask_any_housekeeping() was introduced, but it seems that the
> > discussion died with no fix applied.
>
> The initial complaint was indeed that reading individual events is slower.
>
> The issue is that the intended use case read from many files at frequent
> intervals and thus becomes vulnerable to any changes in this area that
> really is already a slow path (reading from a file ... taking a mutex ...).
>
> Instead of working on shaving cycles off this path the discussion transitioned
> to resctrl providing better support for the underlying use case. I
> understood that this is being experimented with [2] and last I heard it
> looks promising.
>
> >
> > The blocking problem is that ARM may not be able to read a counter
> > on a tick_nohz CPU because it may need to sleep.
If I hadn't already turned my attention to optimizing bulk counter
reads, I might have mentioned that the change Tony referred to is
broken on MPAM implementations because the MPAM
resctrl_arch_rmid_read() cannot wait for its internal mutex with
preemption disabled.
> >
> > Do we need more options for events:
> >
> > 1) Must be read on a CPU in the right domain // Legacy
> > 2) Can be read from any CPU // My addtion
> > 3) Must be read on a "housekeeping" CPU // James' code in upstream
> > 4) Cannot be read on a tick_nohz CPU // Could be combined with 1 or 2?
>
> I do not see needing additional complexity here. I think it will be simpler
> to just replace use of cpumask_any_housekeeping() in mon_event_read() with
> open code that supports the particular usage. As I understand it is prohibited
> for all CPUs to be in tick_nohz_full_mask so it looks to me as though the
> existing "if (tick_nohz_full_cpu(cpu))" should never be true (since no CPU is being excluded).
> Also, since mon_event_read() has no need to exclude CPUs, just a cpumask_andnot()
> should suffice to determine what remains of given mask after accounting for all the
> NO_HZ CPUs if tick_nohz_full_enabled().
Can you clarify what you mean by "all CPUs"? It's not difficult for
all CPUs in an L3 domain to be in tick_nohz_full_mask on AMD
implementations, where there are many small L3 domains (~8 CPUs each)
in a socket.
Google makes use of isolation along this domain boundary on AMD
platforms in some products and these users prefer to read counters
using IPIs because they are concerned about introducing context
switches to the isolated part of the system. In these configurations,
there is typically only one RMID in that domain, so few of these IPIs
are needed. (Note that these are different users from the ones I had
described before who spawn large numbers of containers not limited to
any domains and want to read the MBM counters for all the RMIDs on all
the domains frequently.)
Thanks,
-Peter
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-23 13:27 ` Peter Newman
@ 2025-04-23 15:47 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-23 15:47 UTC (permalink / raw)
To: Peter Newman
Cc: Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Peter,
On 4/23/25 6:27 AM, Peter Newman wrote:
> Hi Reinette,
>
> On Tue, Apr 22, 2025 at 8:20 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Tony,
>>
>> On 4/21/25 1:28 PM, Luck, Tony wrote:
>>> On Fri, Apr 18, 2025 at 03:54:02PM -0700, Reinette Chatre wrote:
>>>>> @@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>>>>> goto out;
>>>>> }
>>>>> d = container_of(hdr, struct rdt_mon_domain, hdr);
>>>>> - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
>>>>> + mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
>>>>> + mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
>>>>
>>>> I do not think this accomplishes the goal of this patch. Looking at mon_event_read() it calls
>>>> cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU) before any of the smp_*() calls.
>>>>
>>>> cpumask_any_housekeeping()
>>>> {
>>>> ...
>>>> if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
>>>> cpu = cpumask_any(mask);
>>>> ...
>>>> }
>>>>
>>>> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
>>>> online mask that may not be the current CPU.
>>>>
>>>> fwiw ... there are some optimizations planned in this area that I have not yet studied:
>>>> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
>>>
>>> I remember Peter complaining[1] about extra context switches when
>>> cpumask_any_housekeeping() was introduced, but it seems that the
>>> discussion died with no fix applied.
>>
>> The initial complaint was indeed that reading individual events is slower.
>>
>> The issue is that the intended use case read from many files at frequent
>> intervals and thus becomes vulnerable to any changes in this area that
>> really is already a slow path (reading from a file ... taking a mutex ...).
>>
>> Instead of working on shaving cycles off this path the discussion transitioned
>> to resctrl providing better support for the underlying use case. I
>> understood that this is being experimented with [2] and last I heard it
>> looks promising.
>>
>>>
>>> The blocking problem is that ARM may not be able to read a counter
>>> on a tick_nohz CPU because it may need to sleep.
>
> If I hadn't already turned my attention to optimizing bulk counter
> reads, I might have mentioned that the change Tony referred to is
> broken on MPAM implementations because the MPAM
> resctrl_arch_rmid_read() cannot wait for its internal mutex with
> preemption disabled.
>
>>>
>>> Do we need more options for events:
>>>
>>> 1) Must be read on a CPU in the right domain // Legacy
>>> 2) Can be read from any CPU // My addtion
>>> 3) Must be read on a "housekeeping" CPU // James' code in upstream
>>> 4) Cannot be read on a tick_nohz CPU // Could be combined with 1 or 2?
>>
>> I do not see needing additional complexity here. I think it will be simpler
>> to just replace use of cpumask_any_housekeeping() in mon_event_read() with
>> open code that supports the particular usage. As I understand it is prohibited
>> for all CPUs to be in tick_nohz_full_mask so it looks to me as though the
>> existing "if (tick_nohz_full_cpu(cpu))" should never be true (since no CPU is being excluded).
>> Also, since mon_event_read() has no need to exclude CPUs, just a cpumask_andnot()
>> should suffice to determine what remains of given mask after accounting for all the
>> NO_HZ CPUs if tick_nohz_full_enabled().
>
> Can you clarify what you mean by "all CPUs"? It's not difficult for
I mentioned this in the context of this patch that adds support for
events that can be ready from *any* CPU. The CPU reading the event data
need not be in the domain for which data is being read so all CPUs
on the system are available to the flow supporting these events. Since
all CPUs on the system cannot be in tick_nohz_full_mask there will always
be a CPU available to read this type of event that can be read from any CPU.
I made it way too complicated with this though. Tony proposed something
much better and simpler [1].
> all CPUs in an L3 domain to be in tick_nohz_full_mask on AMD
> implementations, where there are many small L3 domains (~8 CPUs each)
> in a socket.
>
> Google makes use of isolation along this domain boundary on AMD
> platforms in some products and these users prefer to read counters
> using IPIs because they are concerned about introducing context
> switches to the isolated part of the system. In these configurations,
> there is typically only one RMID in that domain, so few of these IPIs
> are needed. (Note that these are different users from the ones I had
> described before who spawn large numbers of containers not limited to
> any domains and want to read the MBM counters for all the RMIDs on all
> the domains frequently.)
>
Thank you for this insight. There is no change planned for reading
event counters for those events that need to be read from their
domain. Tony's recent proposal [1] moves the handling of these new
style of events to a separate branch.
Reinette
[1] https://lore.kernel.org/lkml/DS7PR11MB607763D8B912A60A3574D2BAFCBA2@DS7PR11MB6077.namprd11.prod.outlook.com/
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (9 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 23:02 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes Tony Luck
` (16 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Add a type field to both the mon_evt and mon_data structures.
Legacy monitor events are still all displayed as an unsigned decimal
64-bit integer.
Add an additional format of fixed-point with 18 binary places displayed
as a floating point value with six decimal places.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl_types.h | 8 ++++++++
fs/resctrl/internal.h | 4 ++++
fs/resctrl/ctrlmondata.c | 23 ++++++++++++++++++++++-
fs/resctrl/monitor.c | 3 +++
fs/resctrl/rdtgroup.c | 1 +
5 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index 898068a99ef7..fbd4b55c41aa 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -58,6 +58,14 @@ enum resctrl_event_id {
#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
+/*
+ * Event value display types
+ */
+enum resctrl_event_type {
+ EVT_TYPE_U64,
+ EVT_TYPE_U46_18,
+};
+
static inline bool resctrl_is_mbm_event(int e)
{
return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 74a77794364d..4a840e683e96 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -71,6 +71,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* struct mon_evt - Entry in the event list of a resource
* @evtid: event id
* @name: name of the event
+ * @type: format for display to user
* @configurable: true if the event is configurable
* @any_cpu: true if this event can be read from any CPU
* @list: entry in &rdt_resource->evt_list
@@ -79,6 +80,7 @@ struct mon_evt {
enum resctrl_event_id evtid;
enum resctrl_res_level rid;
char *name;
+ enum resctrl_event_type type;
bool configurable;
bool any_cpu;
struct list_head list;
@@ -89,6 +91,7 @@ struct mon_evt {
* @list: List of all allocated structures.
* @rid: Resource id associated with the event file.
* @evtid: Event id associated with the event file.
+ * @type: Format for display to user
* @sum: Set when event must be summed across multiple
* domains.
* @domid: When @sum is zero this is the domain to which
@@ -104,6 +107,7 @@ struct mon_data {
struct list_head list;
unsigned int rid;
enum resctrl_event_id evtid;
+ enum resctrl_event_type type;
unsigned int sum;
unsigned int domid;
bool any_cpu;
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index cd77960657f0..5ea8113c96ac 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -562,6 +562,27 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
}
+#define NUM_FRAC_BITS 18
+#define FRAC_MASK GENMASK(NUM_FRAC_BITS - 1, 0)
+
+static void show_value(struct seq_file *m, enum resctrl_event_type type, u64 val)
+{
+ u64 frac;
+
+ switch (type) {
+ case EVT_TYPE_U64:
+ seq_printf(m, "%llu\n", val);
+ break;
+ case EVT_TYPE_U46_18:
+ frac = val & FRAC_MASK;
+ frac = frac * 1000000;
+ frac += 1ul << (NUM_FRAC_BITS - 1);
+ frac >>= NUM_FRAC_BITS;
+ seq_printf(m, "%llu.%06llu\n", val >> NUM_FRAC_BITS, frac);
+ break;
+ }
+}
+
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
@@ -633,7 +654,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
else if (rr.err == -EINVAL)
seq_puts(m, "Unavailable\n");
else
- seq_printf(m, "%llu\n", rr.val);
+ show_value(m, md->type, rr.val);
out:
rdtgroup_kn_unlock(of->kn);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 1cf0b085e07a..1efad57d1d85 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -847,16 +847,19 @@ static struct mon_evt all_events[QOS_NUM_EVENTS] = {
.name = "llc_occupancy",
.evtid = QOS_L3_OCCUP_EVENT_ID,
.rid = RDT_RESOURCE_L3,
+ .type = EVT_TYPE_U64,
},
[QOS_L3_MBM_TOTAL_EVENT_ID] = {
.name = "mbm_total_bytes",
.evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
.rid = RDT_RESOURCE_L3,
+ .type = EVT_TYPE_U64,
},
[QOS_L3_MBM_LOCAL_EVENT_ID] = {
.name = "mbm_local_bytes",
.evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
.rid = RDT_RESOURCE_L3,
+ .type = EVT_TYPE_U64,
},
};
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 97c2ba8af930..bd41f7a0f416 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2927,6 +2927,7 @@ static struct mon_data *mon_get_kn_priv(int rid, int domid, struct mon_evt *mevt
priv->sum = do_sum;
priv->evtid = mevt->evtid;
priv->any_cpu = mevt->any_cpu;
+ priv->type = mevt->type;
list_add_tail(&priv->list, &kn_priv_list);
return priv;
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats
2025-04-07 23:40 ` [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats Tony Luck
@ 2025-04-18 23:02 ` Reinette Chatre
2025-04-21 19:34 ` Luck, Tony
0 siblings, 1 reply; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 23:02 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
Please add context.
On 4/7/25 4:40 PM, Tony Luck wrote:
> Add a type field to both the mon_evt and mon_data structures.
This is getting too much? How about mon_data pointing to mon_evt?
>
> Legacy monitor events are still all displayed as an unsigned decimal
> 64-bit integer.
>
> Add an additional format of fixed-point with 18 binary places displayed
> as a floating point value with six decimal places.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl_types.h | 8 ++++++++
> fs/resctrl/internal.h | 4 ++++
> fs/resctrl/ctrlmondata.c | 23 ++++++++++++++++++++++-
> fs/resctrl/monitor.c | 3 +++
> fs/resctrl/rdtgroup.c | 1 +
> 5 files changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index 898068a99ef7..fbd4b55c41aa 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
Why is this exposed to architecture? There are no users in arch code.
changelog gives no hints on how to interpret this.
> @@ -58,6 +58,14 @@ enum resctrl_event_id {
> #define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> #define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
>
> +/*
> + * Event value display types
How about "Event value display formats"?
> + */
> +enum resctrl_event_type {
resctrl_event_format/resctrl_event_fmt?
> + EVT_TYPE_U64,
EVT_FMT_U64/EVT_FORMAT_U64?
> + EVT_TYPE_U46_18,
> +};
> +
> static inline bool resctrl_is_mbm_event(int e)
> {
> return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 74a77794364d..4a840e683e96 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -71,6 +71,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> * struct mon_evt - Entry in the event list of a resource
> * @evtid: event id
> * @name: name of the event
> + * @type: format for display to user
> * @configurable: true if the event is configurable
> * @any_cpu: true if this event can be read from any CPU
> * @list: entry in &rdt_resource->evt_list
> @@ -79,6 +80,7 @@ struct mon_evt {
> enum resctrl_event_id evtid;
> enum resctrl_res_level rid;
> char *name;
> + enum resctrl_event_type type;
> bool configurable;
> bool any_cpu;
> struct list_head list;
> @@ -89,6 +91,7 @@ struct mon_evt {
> * @list: List of all allocated structures.
> * @rid: Resource id associated with the event file.
> * @evtid: Event id associated with the event file.
> + * @type: Format for display to user
> * @sum: Set when event must be summed across multiple
> * domains.
> * @domid: When @sum is zero this is the domain to which
> @@ -104,6 +107,7 @@ struct mon_data {
> struct list_head list;
> unsigned int rid;
> enum resctrl_event_id evtid;
> + enum resctrl_event_type type;
> unsigned int sum;
> unsigned int domid;
> bool any_cpu;
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index cd77960657f0..5ea8113c96ac 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -562,6 +562,27 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
> }
>
> +#define NUM_FRAC_BITS 18
> +#define FRAC_MASK GENMASK(NUM_FRAC_BITS - 1, 0)
> +
> +static void show_value(struct seq_file *m, enum resctrl_event_type type, u64 val)
show_value() is a bit vague ... print_event_value() ?
> +{
> + u64 frac;
> +
> + switch (type) {
> + case EVT_TYPE_U64:
> + seq_printf(m, "%llu\n", val);
> + break;
> + case EVT_TYPE_U46_18:
> + frac = val & FRAC_MASK;
> + frac = frac * 1000000;
> + frac += 1ul << (NUM_FRAC_BITS - 1);
Could you please help me understand why above line is needed? Seems like
shift below will just undo it?
> + frac >>= NUM_FRAC_BITS;
> + seq_printf(m, "%llu.%06llu\n", val >> NUM_FRAC_BITS, frac);
> + break;
> + }
> +}
> +
> int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> {
> struct kernfs_open_file *of = m->private;
> @@ -633,7 +654,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> else if (rr.err == -EINVAL)
> seq_puts(m, "Unavailable\n");
> else
> - seq_printf(m, "%llu\n", rr.val);
> + show_value(m, md->type, rr.val);
>
> out:
> rdtgroup_kn_unlock(of->kn);
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 1cf0b085e07a..1efad57d1d85 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -847,16 +847,19 @@ static struct mon_evt all_events[QOS_NUM_EVENTS] = {
> .name = "llc_occupancy",
> .evtid = QOS_L3_OCCUP_EVENT_ID,
> .rid = RDT_RESOURCE_L3,
> + .type = EVT_TYPE_U64,
> },
> [QOS_L3_MBM_TOTAL_EVENT_ID] = {
> .name = "mbm_total_bytes",
> .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
> .rid = RDT_RESOURCE_L3,
> + .type = EVT_TYPE_U64,
> },
> [QOS_L3_MBM_LOCAL_EVENT_ID] = {
> .name = "mbm_local_bytes",
> .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
> .rid = RDT_RESOURCE_L3,
> + .type = EVT_TYPE_U64,
> },
> };
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 97c2ba8af930..bd41f7a0f416 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -2927,6 +2927,7 @@ static struct mon_data *mon_get_kn_priv(int rid, int domid, struct mon_evt *mevt
> priv->sum = do_sum;
> priv->evtid = mevt->evtid;
> priv->any_cpu = mevt->any_cpu;
> + priv->type = mevt->type;
> list_add_tail(&priv->list, &kn_priv_list);
>
> return priv;
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats
2025-04-18 23:02 ` Reinette Chatre
@ 2025-04-21 19:34 ` Luck, Tony
2025-04-22 18:20 ` Reinette Chatre
0 siblings, 1 reply; 67+ messages in thread
From: Luck, Tony @ 2025-04-21 19:34 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
On Fri, Apr 18, 2025 at 04:02:08PM -0700, Reinette Chatre wrote:
> > + case EVT_TYPE_U46_18:
> > + frac = val & FRAC_MASK;
> > + frac = frac * 1000000;
> > + frac += 1ul << (NUM_FRAC_BITS - 1);
>
> Could you please help me understand why above line is needed? Seems like
> shift below will just undo it?
>
> > + frac >>= NUM_FRAC_BITS;
> > + seq_printf(m, "%llu.%06llu\n", val >> NUM_FRAC_BITS, frac);
> > + break;
> > + }
The extra addtion is to round the value to nearest decimal supported
value.
E.g. take the case where val == 1 This is a 1 in the 18th binary
place, so the precise decimal representation is:
1 / 2^18 = 0.000003814697265625
rounding to six decimal places should give: 0.000004
If you run the above code without the small addition you get:
frac = val & FRAC_MASK; // frac == 1
frac = frac * 1000000; // frac == 1000000
frac >>= NUM_FRAC_BITS; // frac == 3
So the output with be the truncated, not rounded, 0.000003
The addition will have a "carry bit" in to the upper bits.
That isn't lost when shifting right. The value added is
as if there was a "1" in the 19th binary place.
-Tony
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats
2025-04-21 19:34 ` Luck, Tony
@ 2025-04-22 18:20 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-22 18:20 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Tony,
On 4/21/25 12:34 PM, Luck, Tony wrote:
> On Fri, Apr 18, 2025 at 04:02:08PM -0700, Reinette Chatre wrote:
>>> + case EVT_TYPE_U46_18:
>>> + frac = val & FRAC_MASK;
>>> + frac = frac * 1000000;
>>> + frac += 1ul << (NUM_FRAC_BITS - 1);
>>
>> Could you please help me understand why above line is needed? Seems like
>> shift below will just undo it?
>>
>>> + frac >>= NUM_FRAC_BITS;
>>> + seq_printf(m, "%llu.%06llu\n", val >> NUM_FRAC_BITS, frac);
>>> + break;
>>> + }
>
> The extra addtion is to round the value to nearest decimal supported
> value.
>
> E.g. take the case where val == 1 This is a 1 in the 18th binary
> place, so the precise decimal representation is:
>
> 1 / 2^18 = 0.000003814697265625
>
> rounding to six decimal places should give: 0.000004
>
> If you run the above code without the small addition you get:
>
> frac = val & FRAC_MASK; // frac == 1
> frac = frac * 1000000; // frac == 1000000
> frac >>= NUM_FRAC_BITS; // frac == 3
>
> So the output with be the truncated, not rounded, 0.000003
>
> The addition will have a "carry bit" in to the upper bits.
> That isn't lost when shifting right. The value added is
> as if there was a "1" in the 19th binary place.
>
Thank you very much.
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (10 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 23:11 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 13/26] fs/resctrl: Add an architectural hook called for each mount Tony Luck
` (15 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Architecture code knows whether an event can be read from any CPU, or
from a CPU on a specific domain. It also knows what format to use
when printing each event value.
Add a hook to set mon_event.any_cpu and mon_event.type.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 3 +++
fs/resctrl/monitor.c | 12 ++++++++++++
2 files changed, 15 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0fce626605b9..8ac77b738de5 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -418,6 +418,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
+int resctrl_set_event_attributes(enum resctrl_event_id evt,
+ enum resctrl_event_type type, bool any_cpu);
+
/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 1efad57d1d85..5846a13c631a 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -863,6 +863,18 @@ static struct mon_evt all_events[QOS_NUM_EVENTS] = {
},
};
+int resctrl_set_event_attributes(enum resctrl_event_id evt,
+ enum resctrl_event_type type, bool any_cpu)
+{
+ if (evt >= QOS_NUM_EVENTS)
+ return -ENOENT;
+
+ all_events[evt].type = type;
+ all_events[evt].any_cpu = any_cpu;
+
+ return 0;
+}
+
int rdt_lookup_evtid_by_name(char *name)
{
int evt;
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes
2025-04-07 23:40 ` [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes Tony Luck
@ 2025-04-18 23:11 ` Reinette Chatre
2025-04-21 19:50 ` Luck, Tony
0 siblings, 1 reply; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 23:11 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Architecture code knows whether an event can be read from any CPU, or
> from a CPU on a specific domain. It also knows what format to use
> when printing each event value.
>
> Add a hook to set mon_event.any_cpu and mon_event.type.
If the architecture modifies the output format then the values exposed
to user space will look different between architectures. User space will
need to know how to parse the data. We do not want user space to need to
know which architecture it is running on to determine how to interact with
user space so this makes me think that this change needs to be accompanied
with a change that exposes the event format to user space.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 3 +++
> fs/resctrl/monitor.c | 12 ++++++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 0fce626605b9..8ac77b738de5 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -418,6 +418,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> void resctrl_online_cpu(unsigned int cpu);
> void resctrl_offline_cpu(unsigned int cpu);
>
> +int resctrl_set_event_attributes(enum resctrl_event_id evt,
> + enum resctrl_event_type type, bool any_cpu);
> +
> /**
> * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> * for this resource and domain.
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 1efad57d1d85..5846a13c631a 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -863,6 +863,18 @@ static struct mon_evt all_events[QOS_NUM_EVENTS] = {
> },
> };
>
> +int resctrl_set_event_attributes(enum resctrl_event_id evt,
> + enum resctrl_event_type type, bool any_cpu)
So this is not actually a hook (in the resctrl_arch sense) but a direct interface
for arch code to change resctrl fs internals ... this needs to be done with care.
> +{
> + if (evt >= QOS_NUM_EVENTS)
> + return -ENOENT;
> +
> + all_events[evt].type = type;
> + all_events[evt].any_cpu = any_cpu;
> +
> + return 0;
> +}
> +
> int rdt_lookup_evtid_by_name(char *name)
> {
> int evt;
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes
2025-04-18 23:11 ` Reinette Chatre
@ 2025-04-21 19:50 ` Luck, Tony
2025-04-22 18:20 ` Reinette Chatre
0 siblings, 1 reply; 67+ messages in thread
From: Luck, Tony @ 2025-04-21 19:50 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
On Fri, Apr 18, 2025 at 04:11:22PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 4/7/25 4:40 PM, Tony Luck wrote:
> > Architecture code knows whether an event can be read from any CPU, or
> > from a CPU on a specific domain. It also knows what format to use
> > when printing each event value.
> >
> > Add a hook to set mon_event.any_cpu and mon_event.type.
>
> If the architecture modifies the output format then the values exposed
> to user space will look different between architectures. User space will
> need to know how to parse the data. We do not want user space to need to
> know which architecture it is running on to determine how to interact with
> user space so this makes me think that this change needs to be accompanied
> with a change that exposes the event format to user space.
Would it be enough to include this in Documentation? I.e. add specific
entries for "core_energy" to say that it is reported as a floating point
value with unit Joules, and "activity" is reported as a floating point
value with unit of Farads.
Alternatively the filesystem code could convert the fixed point values
to integer values of micro-Joules and micro-Farads. Then the filesystem
code can print with "%llu" just like every other event. Would still need
the Documentation entries to explain the units. This has the limitation
that some future implementation that measures in greater precision would
be forced to round to nearest micro-{Joule,Farad}.
File system code controls the formating options available, so options
for architecture code to break the user interface are limited.
-Tony
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes
2025-04-21 19:50 ` Luck, Tony
@ 2025-04-22 18:20 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-22 18:20 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Tony,
On 4/21/25 12:50 PM, Luck, Tony wrote:
> On Fri, Apr 18, 2025 at 04:11:22PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 4/7/25 4:40 PM, Tony Luck wrote:
>>> Architecture code knows whether an event can be read from any CPU, or
>>> from a CPU on a specific domain. It also knows what format to use
>>> when printing each event value.
>>>
>>> Add a hook to set mon_event.any_cpu and mon_event.type.
>>
>> If the architecture modifies the output format then the values exposed
>> to user space will look different between architectures. User space will
>> need to know how to parse the data. We do not want user space to need to
>> know which architecture it is running on to determine how to interact with
>> user space so this makes me think that this change needs to be accompanied
>> with a change that exposes the event format to user space.
>
> Would it be enough to include this in Documentation? I.e. add specific
> entries for "core_energy" to say that it is reported as a floating point
> value with unit Joules, and "activity" is reported as a floating point
> value with unit of Farads.
If it is always a floating point, yes, but this patch enables an
architecture to modify output to be different.
>
> Alternatively the filesystem code could convert the fixed point values
> to integer values of micro-Joules and micro-Farads. Then the filesystem
> code can print with "%llu" just like every other event. Would still need
> the Documentation entries to explain the units. This has the limitation
> that some future implementation that measures in greater precision would
> be forced to round to nearest micro-{Joule,Farad}.
I do not think we are talking about the same issue.
>
> File system code controls the formating options available, so options
> for architecture code to break the user interface are limited.
File system gives away the control with this patch, no?
This patch enables the architecture to change the output format. On
one architecture the event may thus be presented as floating point while
on another architecture the event will not be floating point.
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 13/26] fs/resctrl: Add an architectural hook called for each mount
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (11 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-18 23:47 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 14/26] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
` (14 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Enumeration of Intel telemetry events is not complete when the
resctrl "late_init" code is executed.
Add a hook at the beginning of the mount code that will be used
to check for telemetry events and initialize if any are found.
The hook is called on every mount. But expectations are that
most actions (like enumeration) will only need to be performed
on the first call.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 3 +++
arch/x86/kernel/cpu/resctrl/core.c | 9 +++++++++
fs/resctrl/rdtgroup.c | 2 ++
3 files changed, 14 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8ac77b738de5..25f51a57b0b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -421,6 +421,9 @@ void resctrl_offline_cpu(unsigned int cpu);
int resctrl_set_event_attributes(enum resctrl_event_id evt,
enum resctrl_event_type type, bool any_cpu);
+/* Architecture hook called for each file system mount */
+void resctrl_arch_mount(void);
+
/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 59844fd7105f..a066a9c54a1f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -711,6 +711,15 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
return 0;
}
+void resctrl_arch_mount(void)
+{
+ static bool only_once;
+
+ if (only_once)
+ return;
+ only_once = true;
+}
+
enum {
RDT_FLAG_CMT,
RDT_FLAG_MBM_TOTAL,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index bd41f7a0f416..5ca6de6a6e5c 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2564,6 +2564,8 @@ static int rdt_get_tree(struct fs_context *fc)
struct rdt_resource *r;
int ret;
+ resctrl_arch_mount();
+
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
/*
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 13/26] fs/resctrl: Add an architectural hook called for each mount
2025-04-07 23:40 ` [PATCH v3 13/26] fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-04-18 23:47 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 23:47 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Enumeration of Intel telemetry events is not complete when the
> resctrl "late_init" code is executed.
>
> Add a hook at the beginning of the mount code that will be used
> to check for telemetry events and initialize if any are found.
>
> The hook is called on every mount. But expectations are that
> most actions (like enumeration) will only need to be performed
> on the first call.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 3 +++
> arch/x86/kernel/cpu/resctrl/core.c | 9 +++++++++
> fs/resctrl/rdtgroup.c | 2 ++
> 3 files changed, 14 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8ac77b738de5..25f51a57b0b7 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -421,6 +421,9 @@ void resctrl_offline_cpu(unsigned int cpu);
> int resctrl_set_event_attributes(enum resctrl_event_id evt,
> enum resctrl_event_type type, bool any_cpu);
>
> +/* Architecture hook called for each file system mount */
Please add some description of what architecture could use it for as well
as more specific detail of when it is called during mount. I think it is
important to highlight and make it part of agreement that resctrl fs calls
this on mount before any resctrl fs actions. Considering this, perhaps
resctrl_arch_pre_mount()?
It is also worth highlighting in the API doc that fs does not actually
call resctrl_arch_mount() on every mount but every mount *attempt* (resctrl
may already be mounted) so it is up for arch to maintain any needed state.
(Also see later comment about locking)
> +void resctrl_arch_mount(void);
> +
> /**
> * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> * for this resource and domain.
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 59844fd7105f..a066a9c54a1f 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -711,6 +711,15 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
> return 0;
> }
>
> +void resctrl_arch_mount(void)
> +{
> + static bool only_once;
> +
> + if (only_once)
> + return;
> + only_once = true;
> +}
> +
> enum {
> RDT_FLAG_CMT,
> RDT_FLAG_MBM_TOTAL,
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index bd41f7a0f416..5ca6de6a6e5c 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -2564,6 +2564,8 @@ static int rdt_get_tree(struct fs_context *fc)
> struct rdt_resource *r;
> int ret;
>
> + resctrl_arch_mount();
> +
> cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
> /*
Could you please elaborate on the locking requirements here? Worth a mention of these
expectations in changelog also. That it is called without any locks held and arch is
responsible for all locking should be documented as part of API in include/linux/resctrl.h
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 14/26] x86/resctrl: Add first part of telemetry event enumeration
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (12 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 13/26] fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 0:08 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 15/26] x86/resctrl: Second stage " Tony Luck
` (13 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
The OOBMSM driver provides an interface to discover any RMID
based events for "energy" and "perf" classes.
Hold onto references to any pmt_feature_groups that resctrl
uses until resctrl exit.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 8 ++++
arch/x86/kernel/cpu/resctrl/core.c | 5 ++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 62 +++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
4 files changed, 76 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 45eabc7919c6..70b63bbc429d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -172,4 +172,12 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
+#ifdef CONFIG_INTEL_AET_RESCTRL
+bool intel_aet_get_events(void);
+void __exit intel_aet_exit(void);
+#else
+static inline bool intel_aet_get_events(void) { return false; }
+static inline void intel_aet_exit(void) { };
+#endif
+
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a066a9c54a1f..f0f256a5ac66 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -718,6 +718,9 @@ void resctrl_arch_mount(void)
if (only_once)
return;
only_once = true;
+
+ if (!intel_aet_get_events())
+ return;
}
enum {
@@ -1063,6 +1066,8 @@ late_initcall(resctrl_arch_late_init);
static void __exit resctrl_arch_exit(void)
{
+ intel_aet_exit();
+
cpuhp_remove_state(rdt_online);
resctrl_exit();
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
new file mode 100644
index 000000000000..8e531ad279b5
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Resource Director Technology(RDT)
+ * - Intel Application Energy Telemetry
+ *
+ * Copyright (C) 2025 Intel Corporation
+ *
+ * Author:
+ * Tony Luck <tony.luck@intel.com>
+ */
+
+#define pr_fmt(fmt) "resctrl: " fmt
+
+#include <linux/cpu.h>
+#include <linux/cleanup.h>
+#include "fake_intel_aet_features.h"
+#include <linux/intel_vsec.h>
+#include <linux/resctrl.h>
+#include <linux/slab.h>
+
+#include "internal.h"
+
+static struct pmt_feature_group *feat_energy;
+static struct pmt_feature_group *feat_perf;
+
+DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
+ if (!IS_ERR_OR_NULL(_T)) \
+ intel_pmt_put_feature_group(_T))
+
+/*
+ * Ask OOBMSM discovery driver for all the RMID based telemetry groups
+ * that it supports.
+ */
+bool intel_aet_get_events(void)
+{
+ struct pmt_feature_group *p1 __free(intel_pmt_put_feature_group) = NULL;
+ struct pmt_feature_group *p2 __free(intel_pmt_put_feature_group) = NULL;
+ bool use_p1, use_p2;
+
+ p1 = intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
+ p2 = intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
+ use_p1 = !IS_ERR_OR_NULL(p1);
+ use_p2 = !IS_ERR_OR_NULL(p2);
+
+ if (!use_p1 && !use_p2)
+ return false;
+
+ if (use_p1)
+ feat_energy = no_free_ptr(p1);
+ if (use_p2)
+ feat_perf = no_free_ptr(p2);
+
+ return true;
+}
+
+void __exit intel_aet_exit(void)
+{
+ if (feat_energy)
+ intel_pmt_put_feature_group(feat_energy);
+ if (feat_perf)
+ intel_pmt_put_feature_group(feat_perf);
+}
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index c56d3acf8ac7..74c3b2333dde 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_INTEL_AET_RESCTRL) += intel_aet.o
obj-$(CONFIG_INTEL_AET_RESCTRL) += fake_intel_aet_features.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 14/26] x86/resctrl: Add first part of telemetry event enumeration
2025-04-07 23:40 ` [PATCH v3 14/26] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
@ 2025-04-19 0:08 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 0:08 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
The following three patches progress as follows:
x86/resctrl: Add first part ...
x86/resctrl: Second stage ...
x86/resctrl: Third phase ...
Could you please make the language consistent?
On 4/7/25 4:40 PM, Tony Luck wrote:
> The OOBMSM driver provides an interface to discover any RMID
> based events for "energy" and "perf" classes.
Please add context about what an RMID based event is.
>
> Hold onto references to any pmt_feature_groups that resctrl
Please add context about what a
"pmt_feature_groups" (intended to be pmt_feature_group?) is.
> uses until resctrl exit.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 8 ++++
> arch/x86/kernel/cpu/resctrl/core.c | 5 ++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 62 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/Makefile | 1 +
> 4 files changed, 76 insertions(+)
> create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 45eabc7919c6..70b63bbc429d 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -172,4 +172,12 @@ void __init intel_rdt_mbm_apply_quirk(void);
>
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>
> +#ifdef CONFIG_INTEL_AET_RESCTRL
> +bool intel_aet_get_events(void);
> +void __exit intel_aet_exit(void);
> +#else
> +static inline bool intel_aet_get_events(void) { return false; }
> +static inline void intel_aet_exit(void) { };
> +#endif
> +
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index a066a9c54a1f..f0f256a5ac66 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -718,6 +718,9 @@ void resctrl_arch_mount(void)
> if (only_once)
> return;
> only_once = true;
> +
> + if (!intel_aet_get_events())
hmmm ... keep in mind that this is called without
any locking and thus there may be risk of parallel calls?
Please document how/if this is relying on some fs features to
ensure there are not two instances running at same time.
> + return;
> }
>
> enum {
> @@ -1063,6 +1066,8 @@ late_initcall(resctrl_arch_late_init);
>
> static void __exit resctrl_arch_exit(void)
> {
> + intel_aet_exit();
> +
> cpuhp_remove_state(rdt_online);
>
> resctrl_exit();
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> new file mode 100644
> index 000000000000..8e531ad279b5
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -0,0 +1,62 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Resource Director Technology(RDT)
> + * - Intel Application Energy Telemetry
> + *
> + * Copyright (C) 2025 Intel Corporation
> + *
> + * Author:
> + * Tony Luck <tony.luck@intel.com>
> + */
> +
> +#define pr_fmt(fmt) "resctrl: " fmt
> +
> +#include <linux/cpu.h>
> +#include <linux/cleanup.h>
> +#include "fake_intel_aet_features.h"
This include can be marked as "Temporary" to highlight that it
will not stay.
Please separate headers into blocks and sort alphabetically.
> +#include <linux/intel_vsec.h>
> +#include <linux/resctrl.h>
> +#include <linux/slab.h>
Are all these headers used in code below?
> +
> +#include "internal.h"
> +
> +static struct pmt_feature_group *feat_energy;
> +static struct pmt_feature_group *feat_perf;
> +
> +DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
> + if (!IS_ERR_OR_NULL(_T)) \
> + intel_pmt_put_feature_group(_T))
> +
> +/*
> + * Ask OOBMSM discovery driver for all the RMID based telemetry groups
> + * that it supports.
> + */
> +bool intel_aet_get_events(void)
> +{
> + struct pmt_feature_group *p1 __free(intel_pmt_put_feature_group) = NULL;
> + struct pmt_feature_group *p2 __free(intel_pmt_put_feature_group) = NULL;
> + bool use_p1, use_p2;
> +
> + p1 = intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
> + p2 = intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
> + use_p1 = !IS_ERR_OR_NULL(p1);
> + use_p2 = !IS_ERR_OR_NULL(p2);
> +
> + if (!use_p1 && !use_p2)
> + return false;
> +
> + if (use_p1)
> + feat_energy = no_free_ptr(p1);
> + if (use_p2)
> + feat_perf = no_free_ptr(p2);
This reminds me of something I read recently .... "There's a rule in computer
programming that objects appear zero, once, or many times. So code accordingly."
Not expecting a change .... just finding it amusing.
> +
> + return true;
> +}
> +
> +void __exit intel_aet_exit(void)
> +{
> + if (feat_energy)
> + intel_pmt_put_feature_group(feat_energy);
> + if (feat_perf)
> + intel_pmt_put_feature_group(feat_perf);
> +}
> diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
> index c56d3acf8ac7..74c3b2333dde 100644
> --- a/arch/x86/kernel/cpu/resctrl/Makefile
> +++ b/arch/x86/kernel/cpu/resctrl/Makefile
> @@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
> obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
> +obj-$(CONFIG_INTEL_AET_RESCTRL) += intel_aet.o
> obj-$(CONFIG_INTEL_AET_RESCTRL) += fake_intel_aet_features.o
> obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
>
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 15/26] x86/resctrl: Second stage of telemetry event enumeration
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (13 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 14/26] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 0:30 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 16/26] x86/resctrl: Third phase " Tony Luck
` (12 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Scan the telemetry_region structures looking for recognised guid
values. Count how many are found in each package.
Note that telemetry support depends on at least one of the
original RDT monitoring features being enabled (so that the
CPU hotplug notifiers for resctrl are running).
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 112 +++++++++++++++++++++++-
1 file changed, 110 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 8e531ad279b5..9d414dd40f8b 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -23,10 +23,100 @@
static struct pmt_feature_group *feat_energy;
static struct pmt_feature_group *feat_perf;
+/* Per-package event groups active on this machine */
+static struct pkg_info {
+ int count;
+ struct telemetry_region *regions;
+} *pkg_info;
+
+/**
+ * struct pmt_event - Telemetry event.
+ * @evtid: Resctrl event id
+ * @evt_offset: MMIO offset of counter
+ * @type: Type for format user display of event value
+ */
+struct pmt_event {
+ enum resctrl_event_id evtid;
+ int evt_offset;
+ enum resctrl_event_type type;
+};
+
+/**
+ * struct telem_entry - Summarized form from XML telemetry description
+ * @name: Name for this group of events
+ * @guid: Unique ID for this group
+ * @size: Size of MMIO mapped counter registers
+ * @num_rmids: Number of RMIDS supported
+ * @overflow_counter_off: Offset of overflow count
+ * @last_overflow_tstamp_off: Offset of overflow timestamp
+ * @last_update_tstamp_off: Offset of last update timestamp
+ * @active: Marks this group as active on this system
+ * @num_events: Size of @evts array
+ * @evts: Telemetry events in this group
+ */
+struct telem_entry {
+ char *name;
+ int guid;
+ int size;
+ int num_rmids;
+ int overflow_counter_off;
+ int last_overflow_tstamp_off;
+ int last_update_tstamp_off;
+ bool active;
+ int num_events;
+ struct pmt_event evts[];
+};
+
+/* All known telemetry event groups */
+static struct telem_entry *telem_entry[] = {
+ NULL
+};
+
+/*
+ * Scan a feature group looking for guids recognized
+ * and update the per-package counts of known groups.
+ */
+static bool count_events(struct pkg_info *pkg, int max_pkgs, struct pmt_feature_group *p)
+{
+ struct telem_entry **tentry;
+ bool found = false;
+
+ if (IS_ERR_OR_NULL(p))
+ return false;
+
+ for (int i = 0; i < p->count; i++) {
+ struct telemetry_region *tr = &p->regions[i];
+
+ for (tentry = telem_entry; *tentry; tentry++) {
+ if (tr->guid == (*tentry)->guid) {
+ if (tr->plat_info.package_id > max_pkgs) {
+ pr_warn_once("Bad package %d\n", tr->plat_info.package_id);
+ continue;
+ }
+ if (tr->size > (*tentry)->size) {
+ pr_warn_once("MMIO region for guid 0x%x too small\n", tr->guid);
+ continue;
+ }
+ found = true;
+ (*tentry)->active = true;
+ pkg[tr->plat_info.package_id].count++;
+ break;
+ }
+ }
+ }
+
+ return found;
+}
+
DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
if (!IS_ERR_OR_NULL(_T)) \
intel_pmt_put_feature_group(_T))
+DEFINE_FREE(free_pkg_info, struct pkg_info *, \
+ if (_T) \
+ for (int i = 0; i < topology_max_packages(); i++) \
+ kfree(_T[i].regions); \
+ kfree(_T))
/*
* Ask OOBMSM discovery driver for all the RMID based telemetry groups
* that it supports.
@@ -35,20 +125,32 @@ bool intel_aet_get_events(void)
{
struct pmt_feature_group *p1 __free(intel_pmt_put_feature_group) = NULL;
struct pmt_feature_group *p2 __free(intel_pmt_put_feature_group) = NULL;
+ struct pkg_info *pkg __free(free_pkg_info) = NULL;
+ int num_pkgs = topology_max_packages();
bool use_p1, use_p2;
+ pkg = kcalloc(num_pkgs, sizeof(*pkg_info), GFP_KERNEL);
+ if (!pkg)
+ return false;
+
p1 = intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
p2 = intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
- use_p1 = !IS_ERR_OR_NULL(p1);
- use_p2 = !IS_ERR_OR_NULL(p2);
+ use_p1 = count_events(pkg, num_pkgs, p1);
+ use_p2 = count_events(pkg, num_pkgs, p2);
if (!use_p1 && !use_p2)
return false;
+ if (!resctrl_arch_mon_capable()) {
+ pr_info("Telemetry available but monitor support disabled\n");
+ return false;
+ }
+
if (use_p1)
feat_energy = no_free_ptr(p1);
if (use_p2)
feat_perf = no_free_ptr(p2);
+ pkg_info = no_free_ptr(pkg);
return true;
}
@@ -59,4 +161,10 @@ void __exit intel_aet_exit(void)
intel_pmt_put_feature_group(feat_energy);
if (feat_perf)
intel_pmt_put_feature_group(feat_perf);
+
+ if (pkg_info) {
+ for (int i = 0; i < topology_max_packages(); i++)
+ kfree(pkg_info[i].regions);
+ }
+ kfree(pkg_info);
}
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 15/26] x86/resctrl: Second stage of telemetry event enumeration
2025-04-07 23:40 ` [PATCH v3 15/26] x86/resctrl: Second stage " Tony Luck
@ 2025-04-19 0:30 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 0:30 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Scan the telemetry_region structures looking for recognised guid
Please add context before description of what patch does.
Also, please pick British or American English and stick with it.
> values. Count how many are found in each package.
>
> Note that telemetry support depends on at least one of the
> original RDT monitoring features being enabled (so that the
> CPU hotplug notifiers for resctrl are running).
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 112 +++++++++++++++++++++++-
> 1 file changed, 110 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 8e531ad279b5..9d414dd40f8b 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -23,10 +23,100 @@
> static struct pmt_feature_group *feat_energy;
> static struct pmt_feature_group *feat_perf;
>
> +/* Per-package event groups active on this machine */
> +static struct pkg_info {
> + int count;
> + struct telemetry_region *regions;
> +} *pkg_info;
> +
> +/**
> + * struct pmt_event - Telemetry event.
Why does it need "pmt" prefix? Can it be "telem_event" to
match telem_entry?
> + * @evtid: Resctrl event id
> + * @evt_offset: MMIO offset of counter
> + * @type: Type for format user display of event value
I cannot make sense of "Type for format user display of event value"
> + */
> +struct pmt_event {
> + enum resctrl_event_id evtid;
> + int evt_offset;
> + enum resctrl_event_type type;
> +};
> +
> +/**
> + * struct telem_entry - Summarized form from XML telemetry description
Copying from v2 review:
"It is not clear to me how useful it is to document that this is
"Summarized form from XML telemetry description". Either more detail should
be added to help reader understand what XML is being talked about or
the description should be a summary of what this data structure represents."
> + * @name: Name for this group of events
> + * @guid: Unique ID for this group
> + * @size: Size of MMIO mapped counter registers
> + * @num_rmids: Number of RMIDS supported
> + * @overflow_counter_off: Offset of overflow count
Description just rewrites member name and changes "counter" to "count".
Could description have more details about what is represented by this?
What overflowed?
> + * @last_overflow_tstamp_off: Offset of overflow timestamp
What overflowed at this timestamp?
> + * @last_update_tstamp_off: Offset of last update timestamp
What was updated at this timestamp?
> + * @active: Marks this group as active on this system
What does it mean when a group is "active"?
> + * @num_events: Size of @evts array
Would __counted_by() be useful?
> + * @evts: Telemetry events in this group
> + */
> +struct telem_entry {
> + char *name;
> + int guid;
> + int size;
> + int num_rmids;
> + int overflow_counter_off;
> + int last_overflow_tstamp_off;
> + int last_update_tstamp_off;
Most of types are "int" ... I do not expect many of these types to be
negative so I would like to check if int is most appropriate for all?
Usually size_t is used for size and off_t/loff_t is available for
offsets.
> + bool active;
> + int num_events;
> + struct pmt_event evts[];
(missing tab)
> +};
> +
> +/* All known telemetry event groups */
This is more useful by not being "All known Summarized form from XML telemetry description".
> +static struct telem_entry *telem_entry[] = {
> + NULL
> +};
> +
> +/*
> + * Scan a feature group looking for guids recognized
Switch from British to American English in same patch.
> + * and update the per-package counts of known groups.
> + */
> +static bool count_events(struct pkg_info *pkg, int max_pkgs, struct pmt_feature_group *p)
> +{
> + struct telem_entry **tentry;
> + bool found = false;
> +
> + if (IS_ERR_OR_NULL(p))
> + return false;
> +
> + for (int i = 0; i < p->count; i++) {
> + struct telemetry_region *tr = &p->regions[i];
> +
> + for (tentry = telem_entry; *tentry; tentry++) {
> + if (tr->guid == (*tentry)->guid) {
> + if (tr->plat_info.package_id > max_pkgs) {
Should this be >=?
> + pr_warn_once("Bad package %d\n", tr->plat_info.package_id);
> + continue;
> + }
> + if (tr->size > (*tentry)->size) {
> + pr_warn_once("MMIO region for guid 0x%x too small\n", tr->guid);
> + continue;
> + }
> + found = true;
> + (*tentry)->active = true;
> + pkg[tr->plat_info.package_id].count++;
> + break;
> + }
> + }
> + }
> +
> + return found;
> +}
> +
> DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
> if (!IS_ERR_OR_NULL(_T)) \
> intel_pmt_put_feature_group(_T))
>
> +DEFINE_FREE(free_pkg_info, struct pkg_info *, \
> + if (_T) \
> + for (int i = 0; i < topology_max_packages(); i++) \
> + kfree(_T[i].regions); \
> + kfree(_T))
> /*
> * Ask OOBMSM discovery driver for all the RMID based telemetry groups
> * that it supports.
> @@ -35,20 +125,32 @@ bool intel_aet_get_events(void)
> {
> struct pmt_feature_group *p1 __free(intel_pmt_put_feature_group) = NULL;
> struct pmt_feature_group *p2 __free(intel_pmt_put_feature_group) = NULL;
> + struct pkg_info *pkg __free(free_pkg_info) = NULL;
> + int num_pkgs = topology_max_packages();
> bool use_p1, use_p2;
>
> + pkg = kcalloc(num_pkgs, sizeof(*pkg_info), GFP_KERNEL);
sizeof(*pkg)?
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 16/26] x86/resctrl: Third phase of telemetry event enumeration
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (14 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 15/26] x86/resctrl: Second stage " Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 0:45 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 17/26] x86/resctrl: Build a lookup table for each resctrl event id Tony Luck
` (11 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Allocate per-package arrays for the known telemetry_regions and
initialize with pointers to the structures acquired from the
intel_pmt_get_regions_by_feature() calls.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 38 +++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 9d414dd40f8b..fb03f2e76306 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -108,6 +108,30 @@ static bool count_events(struct pkg_info *pkg, int max_pkgs, struct pmt_feature_
return found;
}
+/*
+ * Copy the pointers to telemetry regions associated with a given package
+ * and with known guids over to the pkg_info structure for that package.
+ */
+static int setup(struct pkg_info *pkg, int pkgnum, struct pmt_feature_group *p, int slot)
+{
+ struct telem_entry **tentry;
+
+ for (int i = 0; i < p->count; i++) {
+ for (tentry = telem_entry; *tentry; tentry++) {
+ if (!(*tentry)->active)
+ continue;
+ if (pkgnum != p->regions[i].plat_info.package_id)
+ continue;
+ if (p->regions[i].guid != (*tentry)->guid)
+ continue;
+
+ pkg[pkgnum].regions[slot++] = p->regions[i];
+ }
+ }
+
+ return slot;
+}
+
DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
if (!IS_ERR_OR_NULL(_T)) \
intel_pmt_put_feature_group(_T))
@@ -128,6 +152,7 @@ bool intel_aet_get_events(void)
struct pkg_info *pkg __free(free_pkg_info) = NULL;
int num_pkgs = topology_max_packages();
bool use_p1, use_p2;
+ int slot;
pkg = kcalloc(num_pkgs, sizeof(*pkg_info), GFP_KERNEL);
if (!pkg)
@@ -146,6 +171,19 @@ bool intel_aet_get_events(void)
return false;
}
+ for (int i = 0; i < num_pkgs; i++) {
+ if (!pkg[i].count)
+ continue;
+ pkg[i].regions = kmalloc_array(pkg[i].count, sizeof(*pkg[i].regions), GFP_KERNEL);
+ if (!pkg[i].regions)
+ return false;
+ slot = 0;
+ if (use_p1)
+ slot = setup(pkg, i, p1, slot);
+ if (use_p2)
+ slot = setup(pkg, i, p2, slot);
+ }
+
if (use_p1)
feat_energy = no_free_ptr(p1);
if (use_p2)
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 16/26] x86/resctrl: Third phase of telemetry event enumeration
2025-04-07 23:40 ` [PATCH v3 16/26] x86/resctrl: Third phase " Tony Luck
@ 2025-04-19 0:45 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 0:45 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Allocate per-package arrays for the known telemetry_regions and
> initialize with pointers to the structures acquired from the
> intel_pmt_get_regions_by_feature() calls.
Why?
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 38 +++++++++++++++++++++++++
> 1 file changed, 38 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 9d414dd40f8b..fb03f2e76306 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -108,6 +108,30 @@ static bool count_events(struct pkg_info *pkg, int max_pkgs, struct pmt_feature_
> return found;
> }
>
> +/*
> + * Copy the pointers to telemetry regions associated with a given package
> + * and with known guids over to the pkg_info structure for that package.
> + */
> +static int setup(struct pkg_info *pkg, int pkgnum, struct pmt_feature_group *p, int slot)
setup() is very generic. I do not have a suggestion since it is not yet clear
what this does (more below).
> +{
> + struct telem_entry **tentry;
> +
> + for (int i = 0; i < p->count; i++) {
> + for (tentry = telem_entry; *tentry; tentry++) {
> + if (!(*tentry)->active)
> + continue;
> + if (pkgnum != p->regions[i].plat_info.package_id)
> + continue;
> + if (p->regions[i].guid != (*tentry)->guid)
> + continue;
> +
> + pkg[pkgnum].regions[slot++] = p->regions[i];
(please fix spacing)
hmmm .. this looks like one structure copied to another, not the pointer
copy the comment mentions.
> + }
> + }
> +
> + return slot;
> +}
> +
> DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
> if (!IS_ERR_OR_NULL(_T)) \
> intel_pmt_put_feature_group(_T))
> @@ -128,6 +152,7 @@ bool intel_aet_get_events(void)
> struct pkg_info *pkg __free(free_pkg_info) = NULL;
> int num_pkgs = topology_max_packages();
> bool use_p1, use_p2;
> + int slot;
>
> pkg = kcalloc(num_pkgs, sizeof(*pkg_info), GFP_KERNEL);
> if (!pkg)
> @@ -146,6 +171,19 @@ bool intel_aet_get_events(void)
> return false;
> }
>
> + for (int i = 0; i < num_pkgs; i++) {
> + if (!pkg[i].count)
> + continue;
> + pkg[i].regions = kmalloc_array(pkg[i].count, sizeof(*pkg[i].regions), GFP_KERNEL);
As I understand it sizeof(*pkg[i].regions) would be the size of a
struct telemetry_region that the code in setup() initializes by copying
the data from the struct pmt_feature_group.
The changelog and comments creates impression that resctrl's initialization
consists of adding pointers to the data in struct pmt_feature_group but as
I read it the data is copied instead. Am I reading it wrong?
> + if (!pkg[i].regions)
> + return false;
> + slot = 0;
> + if (use_p1)
> + slot = setup(pkg, i, p1, slot);
> + if (use_p2)
> + slot = setup(pkg, i, p2, slot);
> + }
> +
> if (use_p1)
> feat_energy = no_free_ptr(p1);
> if (use_p2)
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 17/26] x86/resctrl: Build a lookup table for each resctrl event id
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (15 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 16/26] x86/resctrl: Third phase " Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 0:48 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 18/26] x86/resctrl: Add code to read core telemetry events Tony Luck
` (10 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
User requests to read events arrive from the file system layer
with a domain id, RMID, and resctrl event id. Responding to
those requests needs information from various structures.
Build a quick lookup table indexed by resctrl event id.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index fb03f2e76306..44d2fe747ed8 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -67,6 +67,12 @@ struct telem_entry {
struct pmt_event evts[];
};
+/* Lookup table to get from resctrl event id to useful structures */
+static struct evtinfo {
+ struct telem_entry *telem_entry;
+ struct pmt_event *pmt_event;
+} evtinfo[QOS_NUM_EVENTS];
+
/* All known telemetry event groups */
static struct telem_entry *telem_entry[] = {
NULL
@@ -151,6 +157,7 @@ bool intel_aet_get_events(void)
struct pmt_feature_group *p2 __free(intel_pmt_put_feature_group) = NULL;
struct pkg_info *pkg __free(free_pkg_info) = NULL;
int num_pkgs = topology_max_packages();
+ struct telem_entry **tentry;
bool use_p1, use_p2;
int slot;
@@ -184,6 +191,17 @@ bool intel_aet_get_events(void)
slot = setup(pkg, i, p2, slot);
}
+ for (tentry = telem_entry; *tentry; tentry++) {
+ if (!(*tentry)->active)
+ continue;
+ for (int i = 0; i < (*tentry)->num_events; i++) {
+ enum resctrl_event_id evtid = (*tentry)->evts[i].evtid;
+
+ evtinfo[evtid].telem_entry = *tentry;
+ evtinfo[evtid].pmt_event = &(*tentry)->evts[i];
+ }
+ }
+
if (use_p1)
feat_energy = no_free_ptr(p1);
if (use_p2)
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 17/26] x86/resctrl: Build a lookup table for each resctrl event id
2025-04-07 23:40 ` [PATCH v3 17/26] x86/resctrl: Build a lookup table for each resctrl event id Tony Luck
@ 2025-04-19 0:48 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 0:48 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> User requests to read events arrive from the file system layer
> with a domain id, RMID, and resctrl event id. Responding to
> those requests needs information from various structures.
>
> Build a quick lookup table indexed by resctrl event id.
Why is a second lookup table needed? What about the
lookup table ("all_events") that patch #2 introduced?
Perhaps struct mon_evt could get a "void *priv" or related
member that points to event's private data that varies by
type of event?
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 18/26] x86/resctrl: Add code to read core telemetry events
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (16 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 17/26] x86/resctrl: Build a lookup table for each resctrl event id Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 1:53 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values Tony Luck
` (9 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
The new telemetry events will be part of a new resctrl resource.
Add the RDT_RESOURCE_PERF_PKG to enum resctrl_res_level.
Add hook resctrl_arch_rmid_read() to pass reads on this
resource to the telemetry code.
There may be multiple devices tracking each package, so scan all of them
and add up counters.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl_types.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 5 +++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 58 +++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 6 +++
4 files changed, 70 insertions(+)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index fbd4b55c41aa..3354f21e82ad 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -39,6 +39,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_PERF_PKG,
/* Must be the last */
RDT_NUM_RESOURCES,
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 70b63bbc429d..1b1cbb948a9a 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -175,9 +175,14 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
#ifdef CONFIG_INTEL_AET_RESCTRL
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
+int intel_aet_read_event(int domid, int rmid, int evtid, u64 *val);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void intel_aet_exit(void) { };
+static inline int intel_aet_read_event(int domid, int rmid, int evtid, u64 *val)
+{
+ return -EINVAL;
+}
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 44d2fe747ed8..67a1245858dc 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -73,6 +73,12 @@ static struct evtinfo {
struct pmt_event *pmt_event;
} evtinfo[QOS_NUM_EVENTS];
+#define EVT_NUM_RMIDS(evtid) (evtinfo[evtid].telem_entry->num_rmids)
+#define EVT_NUM_EVENTS(evtid) (evtinfo[evtid].telem_entry->num_events)
+#define EVT_GUID(evtid) (evtinfo[evtid].telem_entry->guid)
+
+#define EVT_OFFSET(evtid) (evtinfo[evtid].pmt_event->evt_offset)
+
/* All known telemetry event groups */
static struct telem_entry *telem_entry[] = {
NULL
@@ -224,3 +230,55 @@ void __exit intel_aet_exit(void)
}
kfree(pkg_info);
}
+
+#define VALID_BIT BIT_ULL(63)
+#define DATA_BITS GENMASK_ULL(62, 0)
+
+/*
+ * Walk the array of telemetry groups on a specific package.
+ * Read and sum values for a specific counter (described by
+ * guid and offset).
+ * Return failure (~0x0ull) if any counter isn't valid.
+ */
+static u64 scan_pmt_devs(int package, int guid, int offset)
+{
+ u64 rval, val;
+ int ndev = 0;
+
+ rval = 0;
+
+ for (int i = 0; i < pkg_info[package].count; i++) {
+ if (pkg_info[package].regions[i].guid != guid)
+ continue;
+ ndev++;
+ val = readq(pkg_info[package].regions[i].addr + offset);
+
+ if (!(val & VALID_BIT))
+ return ~0ull;
+ rval += val & DATA_BITS;
+ }
+
+ return ndev ? rval : ~0ull;
+}
+
+/*
+ * Read counter for an event on a domain (summing all aggregators
+ * on the domain).
+ */
+int intel_aet_read_event(int domid, int rmid, int evtid, u64 *val)
+{
+ u64 evtcount;
+ int offset;
+
+ if (rmid >= EVT_NUM_RMIDS(evtid))
+ return -ENOENT;
+
+ offset = rmid * EVT_NUM_EVENTS(evtid) * sizeof(u64);
+ offset += EVT_OFFSET(evtid);
+ evtcount = scan_pmt_devs(domid, EVT_GUID(evtid), offset);
+
+ if (evtcount != ~0ull || *val == 0)
+ *val += evtcount;
+
+ return evtcount != ~0ull ? 0 : -EINVAL;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 06623d51d006..4fa297d463ba 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -236,6 +236,12 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
u32 prmid;
int ret;
+ if (r->rid == RDT_RESOURCE_PERF_PKG) {
+ ret = intel_aet_read_event(d->hdr.id, rmid, eventid, val);
+
+ return ret ? ret : 0;
+ }
+
resctrl_arch_rmid_read_context_check();
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 18/26] x86/resctrl: Add code to read core telemetry events
2025-04-07 23:40 ` [PATCH v3 18/26] x86/resctrl: Add code to read core telemetry events Tony Luck
@ 2025-04-19 1:53 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 1:53 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
(deja vu ... "Add code to" can be dropped)
On 4/7/25 4:40 PM, Tony Luck wrote:
> The new telemetry events will be part of a new resctrl resource.
> Add the RDT_RESOURCE_PERF_PKG to enum resctrl_res_level.
Please follow tip changelog structure custom throughout this series.
>
> Add hook resctrl_arch_rmid_read() to pass reads on this
> resource to the telemetry code.
>
> There may be multiple devices tracking each package, so scan all of them
> and add up counters.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl_types.h | 1 +
> arch/x86/kernel/cpu/resctrl/internal.h | 5 +++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 58 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 6 +++
> 4 files changed, 70 insertions(+)
>
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index fbd4b55c41aa..3354f21e82ad 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -39,6 +39,7 @@ enum resctrl_res_level {
> RDT_RESOURCE_L2,
> RDT_RESOURCE_MBA,
> RDT_RESOURCE_SMBA,
> + RDT_RESOURCE_PERF_PKG,
>
> /* Must be the last */
> RDT_NUM_RESOURCES,
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 70b63bbc429d..1b1cbb948a9a 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -175,9 +175,14 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
> #ifdef CONFIG_INTEL_AET_RESCTRL
> bool intel_aet_get_events(void);
> void __exit intel_aet_exit(void);
> +int intel_aet_read_event(int domid, int rmid, int evtid, u64 *val);
This can use enum resctrl_event_id for evtid?
> #else
> static inline bool intel_aet_get_events(void) { return false; }
> static inline void intel_aet_exit(void) { };
> +static inline int intel_aet_read_event(int domid, int rmid, int evtid, u64 *val)
> +{
> + return -EINVAL;
> +}
> #endif
>
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 44d2fe747ed8..67a1245858dc 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -73,6 +73,12 @@ static struct evtinfo {
> struct pmt_event *pmt_event;
> } evtinfo[QOS_NUM_EVENTS];
>
> +#define EVT_NUM_RMIDS(evtid) (evtinfo[evtid].telem_entry->num_rmids)
> +#define EVT_NUM_EVENTS(evtid) (evtinfo[evtid].telem_entry->num_events)
> +#define EVT_GUID(evtid) (evtinfo[evtid].telem_entry->guid)
> +
> +#define EVT_OFFSET(evtid) (evtinfo[evtid].pmt_event->evt_offset)
Please open code these or use functions if you need to.
> +
> /* All known telemetry event groups */
> static struct telem_entry *telem_entry[] = {
> NULL
> @@ -224,3 +230,55 @@ void __exit intel_aet_exit(void)
> }
> kfree(pkg_info);
> }
> +
> +#define VALID_BIT BIT_ULL(63)
> +#define DATA_BITS GENMASK_ULL(62, 0)
> +
> +/*
> + * Walk the array of telemetry groups on a specific package.
> + * Read and sum values for a specific counter (described by
> + * guid and offset).
> + * Return failure (~0x0ull) if any counter isn't valid.
> + */
> +static u64 scan_pmt_devs(int package, int guid, int offset)
> +{
> + u64 rval, val;
> + int ndev = 0;
> +
> + rval = 0;
This can be done as part of definition.
> +
> + for (int i = 0; i < pkg_info[package].count; i++) {
> + if (pkg_info[package].regions[i].guid != guid)
> + continue;
> + ndev++;
> + val = readq(pkg_info[package].regions[i].addr + offset);
> +
> + if (!(val & VALID_BIT))
> + return ~0ull;
> + rval += val & DATA_BITS;
> + }
> +
> + return ndev ? rval : ~0ull;
> +}
> +
> +/*
> + * Read counter for an event on a domain (summing all aggregators
> + * on the domain).
> + */
> +int intel_aet_read_event(int domid, int rmid, int evtid, u64 *val)
> +{
> + u64 evtcount;
> + int offset;
> +
> + if (rmid >= EVT_NUM_RMIDS(evtid))
> + return -ENOENT;
> +
> + offset = rmid * EVT_NUM_EVENTS(evtid) * sizeof(u64);
> + offset += EVT_OFFSET(evtid);
> + evtcount = scan_pmt_devs(domid, EVT_GUID(evtid), offset);
> +
> + if (evtcount != ~0ull || *val == 0)
> + *val += evtcount;
> +
> + return evtcount != ~0ull ? 0 : -EINVAL;
> +}
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 06623d51d006..4fa297d463ba 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -236,6 +236,12 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> u32 prmid;
> int ret;
>
> + if (r->rid == RDT_RESOURCE_PERF_PKG) {
> + ret = intel_aet_read_event(d->hdr.id, rmid, eventid, val);
> +
> + return ret ? ret : 0;
> + }
Not sure if I am missing something at this stage but it looks like,
since resctrl_arch_rmid_read() can now return ENOENT, and rmid_read::err
obtain value of ENOENT, that there may be an
issue when this error is returned since rdtgroup_mondata_show()'s "checkresult"
does not have handling for ENOENT and will attempt to print data to user space.
> +
> resctrl_arch_rmid_read_context_check();
Please keep this context check at top of function.
>
> prmid = logical_rmid_to_physical_rmid(cpu, rmid);
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (17 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 18/26] x86/resctrl: Add code to read core telemetry events Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 5:14 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 20/26] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
` (8 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
There are three values of interest:
1) The number of RMIDs supported by the CPU core. This is enumerated by
CPUID leaf 0xF. Linux saves the value in boot_cpu_data.x86_cache_max_rmid.
2) The number of counter registers in each telemetry region. This is
described in the XML file for the region. Linux hard codes it into
the struct telem_entry..num_rmids field.
3) The maximum number of RMIDs that can be tracked simultaneously for
a telemetry region. This is provided in the structures received from
the intel_pmt_get_regions_by_feature() calls.
Print appropriate warnings if these values do not match.
TODO: Need a better UI. The number of implemented counters can be
different per telemetry region.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 31 +++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 67a1245858dc..0bcbac326bee 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -13,6 +13,7 @@
#include <linux/cpu.h>
#include <linux/cleanup.h>
+#include <linux/minmax.h>
#include "fake_intel_aet_features.h"
#include <linux/intel_vsec.h>
#include <linux/resctrl.h>
@@ -51,6 +52,7 @@ struct pmt_event {
* @last_overflow_tstamp_off: Offset of overflow timestamp
* @last_update_tstamp_off: Offset of last update timestamp
* @active: Marks this group as active on this system
+ * @rmid_warned: Set to stop multiple rmid sanity warnings
* @num_events: Size of @evts array
* @evts: Telemetry events in this group
*/
@@ -63,6 +65,7 @@ struct telem_entry {
int last_overflow_tstamp_off;
int last_update_tstamp_off;
bool active;
+ bool rmid_warned;
int num_events;
struct pmt_event evts[];
};
@@ -84,6 +87,33 @@ static struct telem_entry *telem_entry[] = {
NULL
};
+static void rmid_sanity_check(struct telemetry_region *tr, struct telem_entry *tentry)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
+ int system_rmids = boot_cpu_data.x86_cache_max_rmid + 1;
+
+ if (tentry->rmid_warned)
+ return;
+
+ if (tentry->num_rmids != system_rmids) {
+ pr_info("Telemetry region %s has %d RMIDs system supports %d\n",
+ tentry->name, tentry->num_rmids, system_rmids);
+ tentry->rmid_warned = true;
+ }
+
+ if (tr->num_rmids < tentry->num_rmids) {
+ pr_info("Telemetry region %s only supports %d simultaneous RMIDS\n",
+ tentry->name, tr->num_rmids);
+ tentry->rmid_warned = true;
+ }
+
+ /* info/PKG_PERF_MON/num_rmids reports number of guaranteed counters */
+ if (!r->num_rmid)
+ r->num_rmid = tr->num_rmids;
+ else
+ r->num_rmid = min((u32)r->num_rmid, tr->num_rmids);
+}
+
/*
* Scan a feature group looking for guids recognized
* and update the per-package counts of known groups.
@@ -109,6 +139,7 @@ static bool count_events(struct pkg_info *pkg, int max_pkgs, struct pmt_feature_
pr_warn_once("MMIO region for guid 0x%x too small\n", tr->guid);
continue;
}
+ rmid_sanity_check(tr, *tentry);
found = true;
(*tentry)->active = true;
pkg[tr->plat_info.package_id].count++;
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values
2025-04-07 23:40 ` [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values Tony Luck
@ 2025-04-19 5:14 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 5:14 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> There are three values of interest:
> 1) The number of RMIDs supported by the CPU core. This is enumerated by
> CPUID leaf 0xF. Linux saves the value in boot_cpu_data.x86_cache_max_rmid.
> 2) The number of counter registers in each telemetry region. This is
> described in the XML file for the region. Linux hard codes it into
> the struct telem_entry..num_rmids field.
Syntax telem_entry::num_rmids can be used for a member.
> 3) The maximum number of RMIDs that can be tracked simultaneously for
> a telemetry region. This is provided in the structures received from
> the intel_pmt_get_regions_by_feature() calls.
Is (2) and (3) not required to be the same? If not, how does resctrl know
which counter/RMID is being tracked?
>
> Print appropriate warnings if these values do not match.
As mentioned in cover letter I do not think that just printing a warning
is sufficient. It really becomes a trial-and-error guessing game for user
space to know which monitor group supports telemetry events.
>
> TODO: Need a better UI. The number of implemented counters can be
> different per telemetry region.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 31 +++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 67a1245858dc..0bcbac326bee 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -13,6 +13,7 @@
>
> #include <linux/cpu.h>
> #include <linux/cleanup.h>
> +#include <linux/minmax.h>
Please sort includes alphabetically.
> #include "fake_intel_aet_features.h"
> #include <linux/intel_vsec.h>
> #include <linux/resctrl.h>
> @@ -51,6 +52,7 @@ struct pmt_event {
> * @last_overflow_tstamp_off: Offset of overflow timestamp
> * @last_update_tstamp_off: Offset of last update timestamp
> * @active: Marks this group as active on this system
> + * @rmid_warned: Set to stop multiple rmid sanity warnings
rmid -> RMID.
I find the description unclear on how to interact with this member. How about
something like:
True if user space have been warned about number of RMIDs used by
different resources not matching.
> * @num_events: Size of @evts array
> * @evts: Telemetry events in this group
> */
> @@ -63,6 +65,7 @@ struct telem_entry {
> int last_overflow_tstamp_off;
> int last_update_tstamp_off;
> bool active;
> + bool rmid_warned;
> int num_events;
> struct pmt_event evts[];
> };
> @@ -84,6 +87,33 @@ static struct telem_entry *telem_entry[] = {
> NULL
> };
>
> +static void rmid_sanity_check(struct telemetry_region *tr, struct telem_entry *tentry)
> +{
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> + int system_rmids = boot_cpu_data.x86_cache_max_rmid + 1;
It is not clear what "system_rmids" should represent here. Is it, as changelog states,
maximum supported by CPU core, or is it maximum supported by L3 resource, which is the
maximum number of monitor groups that can be created.
We see in rdt_get_mon_l3_config() that:
r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
This makes me wonder how this feature behaves on SNC systems?
> +
> + if (tentry->rmid_warned)
> + return;
> +
> + if (tentry->num_rmids != system_rmids) {
> + pr_info("Telemetry region %s has %d RMIDs system supports %d\n",
Is pr_info() intended to be pr_warn()?
The message self could do with a comma?
> + tentry->name, tentry->num_rmids, system_rmids);
> + tentry->rmid_warned = true;
> + }
Could you please add comments about consequences of when this is encountered?
> +
> + if (tr->num_rmids < tentry->num_rmids) {
> + pr_info("Telemetry region %s only supports %d simultaneous RMIDS\n",
> + tentry->name, tr->num_rmids);
> + tentry->rmid_warned = true;
> + }
I am still trying to get used to all the data structures. From what I can tell, the
offset of counter is obtained from struct telem_entry. If struct telem_entry thus
thinks there are more RMIDs than what the region supports, would this not cause
memory reads to exceed what region supports?
Could you please add comments about consequences of when this is encountered?
> +
> + /* info/PKG_PERF_MON/num_rmids reports number of guaranteed counters */
> + if (!r->num_rmid)
> + r->num_rmid = tr->num_rmids;
> + else
> + r->num_rmid = min((u32)r->num_rmid, tr->num_rmids);
> +}
As I mentioned in response to previous version it may be possible to move
resctrl_mon_resource_init() to rdt_get_tree() to be done after these RMID
counts are discovered. When doing so it is possible to size the available
RMIDs used on system to be supported by all resources.
> +
> /*
> * Scan a feature group looking for guids recognized
> * and update the per-package counts of known groups.
> @@ -109,6 +139,7 @@ static bool count_events(struct pkg_info *pkg, int max_pkgs, struct pmt_feature_
> pr_warn_once("MMIO region for guid 0x%x too small\n", tr->guid);
> continue;
> }
> + rmid_sanity_check(tr, *tentry);
> found = true;
> (*tentry)->active = true;
> pkg[tr->plat_info.package_id].count++;
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 20/26] x86/resctrl: Add and initialize rdt_resource for package scope core monitor
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (18 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-07 23:40 ` [PATCH v3 21/26] fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete Tony Luck
` (7 subsequent siblings)
27 siblings, 0 replies; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Counts for each Intel telemetry event are periodically sent to one or
more aggregators on each package where accumulated totals are made
available in MMIO registers.
Add a new resource for monitoring these events with code to build
domains at the package granularity.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 11 +++++++++++
2 files changed, 12 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 25f51a57b0b7..c03e7dc1f009 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -232,6 +232,7 @@ enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
RESCTRL_L3_NODE,
+ RESCTRL_PACKAGE,
};
/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f0f256a5ac66..9578d9c7260c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -99,6 +99,15 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
.schema_fmt = RESCTRL_SCHEMA_RANGE,
},
},
+ [RDT_RESOURCE_PERF_PKG] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .name = "PERF_PKG",
+ .mon_scope = RESCTRL_PACKAGE,
+ .mon_domains = mon_domain_init(RDT_RESOURCE_PERF_PKG),
+ },
+ },
};
u32 resctrl_arch_system_num_rmid_idx(void)
@@ -431,6 +440,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return get_cpu_cacheinfo_id(cpu, scope);
case RESCTRL_L3_NODE:
return cpu_to_node(cpu);
+ case RESCTRL_PACKAGE:
+ return topology_physical_package_id(cpu);
default:
break;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* [PATCH v3 21/26] fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (19 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 20/26] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 5:22 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 22/26] fs/resctrl: Add type define for PERF_PKG files Tony Luck
` (6 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Add a new rdt_perf_pkg_mon_domain structure. This only consists of
the common rdt_domain_hdr as there is no need for any per-domain
data structures.
Use as much as possible of the existing domain setup and tear down
infrastructure. In many cases the RDT_RESOURCE_PERF_PKG uses the
same functions but just skips over the pieces it does not need.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 8 ++++++++
arch/x86/kernel/cpu/resctrl/core.c | 32 ++++++++++++++++++++++++++++++
fs/resctrl/rdtgroup.c | 11 ++++++++--
3 files changed, 49 insertions(+), 2 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c03e7dc1f009..6f598a64b192 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -169,6 +169,14 @@ struct rdt_mon_domain {
int cqm_work_cpu;
};
+/**
+ * struct rdt_perf_pkg_mon_domain - CPUs sharing an Intel-PMT-scoped resctrl monitor resource
+ * @hdr: common header for different domain types
+ */
+struct rdt_perf_pkg_mon_domain {
+ struct rdt_domain_hdr hdr;
+};
+
/**
* struct resctrl_cache - Cache allocation related data
* @cbm_len: Length of the cache bit mask
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9578d9c7260c..6f5d52a8219b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -542,6 +542,29 @@ static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct
}
}
+static void setup_intel_aet_mon_domain(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos)
+{
+ struct rdt_perf_pkg_mon_domain *d;
+ int err;
+
+ d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
+ if (!d)
+ return;
+
+ d->hdr.id = id;
+ d->hdr.type = DOMTYPE(r->rid, DOMTYPE_MON);
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ list_add_tail_rcu(&d->hdr.list, add_pos);
+
+ err = resctrl_online_mon_domain(r, &d->hdr);
+ if (err) {
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ kfree(d);
+ }
+}
+
static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
@@ -571,6 +594,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
case RDT_RESOURCE_L3:
setup_l3_mon_domain(cpu, id, r, add_pos);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ setup_intel_aet_mon_domain(cpu, id, r, add_pos);
+ break;
default:
WARN_ON_ONCE(1);
}
@@ -668,6 +694,12 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
synchronize_rcu();
mon_domain_free(hw_dom);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ resctrl_offline_mon_domain(r, d);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
+ kfree(container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr));
+ break;
}
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 5ca6de6a6e5c..34fcd20f8dd7 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4020,6 +4020,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
if (resctrl_mounted && resctrl_arch_mon_capable())
rmdir_mondata_subdir_allrdtgrp(r, &d->hdr);
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ goto done;
+
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) {
@@ -4036,7 +4039,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
}
domain_destroy_mon_state(d);
-
+done:
mutex_unlock(&rdtgroup_mutex);
}
@@ -4104,12 +4107,15 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
struct rdt_mon_domain *d;
- int err;
+ int err = 0;
WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON));
d = container_of(hdr, struct rdt_mon_domain, hdr);
mutex_lock(&rdtgroup_mutex);
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ goto do_mkdir;
+
err = domain_setup_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4123,6 +4129,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (resctrl_arch_is_llc_occupancy_enabled())
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
+do_mkdir:
/*
* If the filesystem is not mounted then only the default resource group
* exists. Creation of its directories is deferred until mount time
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 21/26] fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete
2025-04-07 23:40 ` [PATCH v3 21/26] fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete Tony Luck
@ 2025-04-19 5:22 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 5:22 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Add a new rdt_perf_pkg_mon_domain structure. This only consists of
> the common rdt_domain_hdr as there is no need for any per-domain
> data structures.
>
> Use as much as possible of the existing domain setup and tear down
> infrastructure. In many cases the RDT_RESOURCE_PERF_PKG uses the
> same functions but just skips over the pieces it does not need.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 8 ++++++++
> arch/x86/kernel/cpu/resctrl/core.c | 32 ++++++++++++++++++++++++++++++
> fs/resctrl/rdtgroup.c | 11 ++++++++--
> 3 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index c03e7dc1f009..6f598a64b192 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -169,6 +169,14 @@ struct rdt_mon_domain {
> int cqm_work_cpu;
> };
>
> +/**
> + * struct rdt_perf_pkg_mon_domain - CPUs sharing an Intel-PMT-scoped resctrl monitor resource
It is a red flag when architecture specific things ("rdt" and "Intel-PMT-scoped") land
in include/linux/resctrl.h
> + * @hdr: common header for different domain types
> + */
> +struct rdt_perf_pkg_mon_domain {
> + struct rdt_domain_hdr hdr;
> +};
> +
> /**
> * struct resctrl_cache - Cache allocation related data
> * @cbm_len: Length of the cache bit mask
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 9578d9c7260c..6f5d52a8219b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -542,6 +542,29 @@ static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct
> }
> }
>
> +static void setup_intel_aet_mon_domain(int cpu, int id, struct rdt_resource *r,
> + struct list_head *add_pos)
> +{
> + struct rdt_perf_pkg_mon_domain *d;
> + int err;
> +
> + d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
> + if (!d)
> + return;
> +
> + d->hdr.id = id;
> + d->hdr.type = DOMTYPE(r->rid, DOMTYPE_MON);
> + cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + list_add_tail_rcu(&d->hdr.list, add_pos);
> +
> + err = resctrl_online_mon_domain(r, &d->hdr);
> + if (err) {
> + list_del_rcu(&d->hdr.list);
> + synchronize_rcu();
> + kfree(d);
> + }
> +}
> +
> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> @@ -571,6 +594,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> case RDT_RESOURCE_L3:
> setup_l3_mon_domain(cpu, id, r, add_pos);
> break;
> + case RDT_RESOURCE_PERF_PKG:
> + setup_intel_aet_mon_domain(cpu, id, r, add_pos);
> + break;
> default:
> WARN_ON_ONCE(1);
> }
> @@ -668,6 +694,12 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> synchronize_rcu();
> mon_domain_free(hw_dom);
> break;
> + case RDT_RESOURCE_PERF_PKG:
> + resctrl_offline_mon_domain(r, d);
Something should have complained about using an uninitialized variable here (d).
> + list_del_rcu(&hdr->list);
> + synchronize_rcu();
> + kfree(container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr));
> + break;
> }
> }
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 5ca6de6a6e5c..34fcd20f8dd7 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -4020,6 +4020,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> if (resctrl_mounted && resctrl_arch_mon_capable())
> rmdir_mondata_subdir_allrdtgrp(r, &d->hdr);
>
> + if (r->rid == RDT_RESOURCE_PERF_PKG)
> + goto done;
Could you please change this test to
if (r->rid != RDT_RESOURCE_L3)
this makes it clear about what the code that follows supports as opposed to one resource that it does
not support (and possibly cause issue in future when adding another monitoring resource).
> +
> if (resctrl_is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) {
> @@ -4036,7 +4039,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> }
>
> domain_destroy_mon_state(d);
> -
> +done:
> mutex_unlock(&rdtgroup_mutex);
> }
>
> @@ -4104,12 +4107,15 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
> int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> struct rdt_mon_domain *d;
> - int err;
> + int err = 0;
>
> WARN_ON_ONCE(hdr->type != DOMTYPE(r->rid, DOMTYPE_MON));
> d = container_of(hdr, struct rdt_mon_domain, hdr);
> mutex_lock(&rdtgroup_mutex);
>
> + if (r->rid == RDT_RESOURCE_PERF_PKG)
> + goto do_mkdir;
same
> +
> err = domain_setup_mon_state(r, d);
> if (err)
> goto out_unlock;
> @@ -4123,6 +4129,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
> if (resctrl_arch_is_llc_occupancy_enabled())
> INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
>
> +do_mkdir:
> /*
> * If the filesystem is not mounted then only the default resource group
> * exists. Creation of its directories is deferred until mount time
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 22/26] fs/resctrl: Add type define for PERF_PKG files
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (20 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 21/26] fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-07 23:40 ` [PATCH v3 23/26] fs/resctrl: Add new telemetry event id and structures Tony Luck
` (5 subsequent siblings)
27 siblings, 0 replies; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Creation of the default info file for monitor resources requires
an RFTYPE_RES_ define and mapping from the resource id.
Add the define and case in fflags_from_resource().
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 2 ++
fs/resctrl/rdtgroup.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 4a840e683e96..b7bc820da726 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -253,6 +253,8 @@ struct rdtgroup {
#define RFTYPE_DEBUG BIT(10)
+#define RFTYPE_RES_PERF_PKG BIT(11)
+
#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 34fcd20f8dd7..cae68e8b9f86 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2175,6 +2175,8 @@ static unsigned long fflags_from_resource(struct rdt_resource *r)
case RDT_RESOURCE_MBA:
case RDT_RESOURCE_SMBA:
return RFTYPE_RES_MB;
+ case RDT_RESOURCE_PERF_PKG:
+ return RFTYPE_RES_PERF_PKG;
}
return WARN_ON_ONCE(1);
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* [PATCH v3 23/26] fs/resctrl: Add new telemetry event id and structures
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (21 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 22/26] fs/resctrl: Add type define for PERF_PKG files Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-07 23:40 ` [PATCH v3 24/26] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
` (4 subsequent siblings)
27 siblings, 0 replies; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Nine new events for energy and perf monitoring per-RMID.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl_types.h | 19 ++++++++++++---
fs/resctrl/monitor.c | 45 +++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+), 3 deletions(-)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index 3354f21e82ad..2c959e7233dd 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -46,16 +46,29 @@ enum resctrl_res_level {
};
/*
- * Event IDs, the values match those used to program IA32_QM_EVTSEL before
- * reading IA32_QM_CTR on RDT systems.
+ * Event IDs
*/
enum resctrl_event_id {
+ /* Legacy events. Values must match X86 IA32_QM_EVTSEL usage */
QOS_L3_OCCUP_EVENT_ID = 0x01,
QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
+
+ /* Intel Telemetry Events */
+ PMT_EVENT_ENERGY,
+ PMT_EVENT_ACTIVITY,
+ PMT_EVENT_STALLS_LLC_HIT,
+ PMT_EVENT_C1_RES,
+ PMT_EVENT_UNHALTED_CORE_CYCLES,
+ PMT_EVENT_STALLS_LLC_MISS,
+ PMT_EVENT_AUTO_C6_RES,
+ PMT_EVENT_UNHALTED_REF_CYCLES,
+ PMT_EVENT_UOPS_RETIRED,
+
+ /* Must be the last */
+ QOS_NUM_EVENTS
};
-#define QOS_NUM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID + 1)
#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 5846a13c631a..0207c9ed2d47 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -861,6 +861,51 @@ static struct mon_evt all_events[QOS_NUM_EVENTS] = {
.rid = RDT_RESOURCE_L3,
.type = EVT_TYPE_U64,
},
+ [PMT_EVENT_ENERGY] = {
+ .name = "core_energy",
+ .evtid = PMT_EVENT_ENERGY,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_ACTIVITY] = {
+ .name = "activity",
+ .evtid = PMT_EVENT_ACTIVITY,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_STALLS_LLC_HIT] = {
+ .name = "stalls_llc_hit",
+ .evtid = PMT_EVENT_STALLS_LLC_HIT,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_C1_RES] = {
+ .name = "c1_res",
+ .evtid = PMT_EVENT_C1_RES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_UNHALTED_CORE_CYCLES] = {
+ .name = "unhalted_core_cycles",
+ .evtid = PMT_EVENT_UNHALTED_CORE_CYCLES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_STALLS_LLC_MISS] = {
+ .name = "stalls_llc_miss",
+ .evtid = PMT_EVENT_STALLS_LLC_MISS,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_AUTO_C6_RES] = {
+ .name = "c6_res",
+ .evtid = PMT_EVENT_AUTO_C6_RES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_UNHALTED_REF_CYCLES] = {
+ .name = "unhalted_ref_cycles",
+ .evtid = PMT_EVENT_UNHALTED_REF_CYCLES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_UOPS_RETIRED] = {
+ .name = "uops_retired",
+ .evtid = PMT_EVENT_UOPS_RETIRED,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
};
int resctrl_set_event_attributes(enum resctrl_event_id evt,
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* [PATCH v3 24/26] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (22 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 23/26] fs/resctrl: Add new telemetry event id and structures Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-07 23:40 ` [PATCH v3 25/26] fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events Tony Luck
` (3 subsequent siblings)
27 siblings, 0 replies; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
For each of the discovered telemetry events:
Mark as enabled in the rdt_mon_features bitmap.
Set the value display type.
Mark that the event can be read from any CPU.
Because the resource was not marked as enabled during early
initialization no domain discovery and allocation was done.
Do that in the architecture first mount hook.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 14 ++++++++++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 +++++
2 files changed, 19 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6f5d52a8219b..83da63b24f45 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -756,7 +756,9 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
void resctrl_arch_mount(void)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
static bool only_once;
+ int cpu;
if (only_once)
return;
@@ -764,6 +766,18 @@ void resctrl_arch_mount(void)
if (!intel_aet_get_events())
return;
+
+ /*
+ * Late discovery of telemetry events means the domains for the
+ * resource were not built. Do that now.
+ */
+ cpus_read_lock();
+ mutex_lock(&domain_list_lock);
+ r->mon_capable = true;
+ for_each_online_cpu(cpu)
+ domain_add_cpu_mon(cpu, r);
+ mutex_unlock(&domain_list_lock);
+ cpus_read_unlock();
}
enum {
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 0bcbac326bee..529f6d49e3a3 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -233,9 +233,14 @@ bool intel_aet_get_events(void)
continue;
for (int i = 0; i < (*tentry)->num_events; i++) {
enum resctrl_event_id evtid = (*tentry)->evts[i].evtid;
+ enum resctrl_event_type type;
evtinfo[evtid].telem_entry = *tentry;
evtinfo[evtid].pmt_event = &(*tentry)->evts[i];
+
+ __set_bit(evtid, rdt_mon_features);
+ type = (*tentry)->evts[i].type;
+ resctrl_set_event_attributes(evtid, type, true);
}
}
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* [PATCH v3 25/26] fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (23 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 24/26] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 5:30 ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 26/26] x86/resctrl: Update Documentation for package events Tony Luck
` (2 subsequent siblings)
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
There are two event groups one for energy reporting and another
for "perf" events.
See the XML description files in https://github.com/intel/Intel-PMT
in the xml/CWF/OOBMSM/{RMID-ENERGY,RMID-PERF}/ for the detailed
descriptions that were used to derive these descriptions.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 57 +++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 529f6d49e3a3..e1097767009e 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -42,6 +42,8 @@ struct pmt_event {
enum resctrl_event_type type;
};
+#define EVT(id, offset, _type) { .evtid = id, .evt_offset = offset, .type = _type }
+
/**
* struct telem_entry - Summarized form from XML telemetry description
* @name: Name for this group of events
@@ -82,8 +84,63 @@ static struct evtinfo {
#define EVT_OFFSET(evtid) (evtinfo[evtid].pmt_event->evt_offset)
+/*
+ * https://github.com/intel/Intel-PMT
+ * xml/CWF/OOBMSM/RMID-ENERGY *.xml
+ */
+#define NUM_RMIDS_0x26696143 576
+#define GUID_0x26696143 0x26696143
+#define NUM_EVENTS_0x26696143 2
+#define EVT_BYTES_0x26696143 (NUM_RMIDS_0x26696143 * NUM_EVENTS_0x26696143 * sizeof(u64))
+
+static struct telem_entry energy_0x26696143 = {
+ .name = "energy",
+ .guid = GUID_0x26696143,
+ .size = EVT_BYTES_0x26696143 + sizeof(u64) * 3,
+ .num_rmids = NUM_RMIDS_0x26696143,
+ .overflow_counter_off = EVT_BYTES_0x26696143 + sizeof(u64) * 0,
+ .last_overflow_tstamp_off = EVT_BYTES_0x26696143 + sizeof(u64) * 1,
+ .last_update_tstamp_off = EVT_BYTES_0x26696143 + sizeof(u64) * 2,
+ .num_events = NUM_EVENTS_0x26696143,
+ .evts = {
+ EVT(PMT_EVENT_ENERGY, 0x0, EVT_TYPE_U46_18),
+ EVT(PMT_EVENT_ACTIVITY, 0x8, EVT_TYPE_U46_18),
+ }
+};
+
+/*
+ * https://github.com/intel/Intel-PMT
+ * xml/CWF/OOBMSM/RMID-PERF *.xml
+ */
+#define NUM_RMIDS_0x26557651 576
+#define GUID_0x26557651 0x26557651
+#define NUM_EVENTS_0x26557651 7
+#define EVT_BYTES_0x26557651 (NUM_RMIDS_0x26557651 * NUM_EVENTS_0x26557651 * sizeof(u64))
+
+static struct telem_entry perf_0x26557651 = {
+ .name = "perf",
+ .guid = GUID_0x26557651,
+ .size = EVT_BYTES_0x26557651 + sizeof(u64) * 3,
+ .num_rmids = NUM_RMIDS_0x26557651,
+ .overflow_counter_off = EVT_BYTES_0x26557651 + sizeof(u64) * 0,
+ .last_overflow_tstamp_off = EVT_BYTES_0x26557651 + sizeof(u64) * 1,
+ .last_update_tstamp_off = EVT_BYTES_0x26557651 + sizeof(u64) * 2,
+ .num_events = NUM_EVENTS_0x26557651,
+ .evts = {
+ EVT(PMT_EVENT_STALLS_LLC_HIT, 0x0, EVT_TYPE_U64),
+ EVT(PMT_EVENT_C1_RES, 0x8, EVT_TYPE_U64),
+ EVT(PMT_EVENT_UNHALTED_CORE_CYCLES, 0x10, EVT_TYPE_U64),
+ EVT(PMT_EVENT_STALLS_LLC_MISS, 0x18, EVT_TYPE_U64),
+ EVT(PMT_EVENT_AUTO_C6_RES, 0x20, EVT_TYPE_U64),
+ EVT(PMT_EVENT_UNHALTED_REF_CYCLES, 0x28, EVT_TYPE_U64),
+ EVT(PMT_EVENT_UOPS_RETIRED, 0x30, EVT_TYPE_U64),
+ }
+};
+
/* All known telemetry event groups */
static struct telem_entry *telem_entry[] = {
+ &energy_0x26696143,
+ &perf_0x26557651,
NULL
};
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 25/26] fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events
2025-04-07 23:40 ` [PATCH v3 25/26] fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events Tony Luck
@ 2025-04-19 5:30 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 5:30 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> There are two event groups one for energy reporting and another
> for "perf" events.
Please add context.
>
> See the XML description files in https://github.com/intel/Intel-PMT
> in the xml/CWF/OOBMSM/{RMID-ENERGY,RMID-PERF}/ for the detailed
> descriptions that were used to derive these descriptions.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
The url in text can be a "Link:" here.
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 57 +++++++++++++++++++++++++
> 1 file changed, 57 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 529f6d49e3a3..e1097767009e 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -42,6 +42,8 @@ struct pmt_event {
> enum resctrl_event_type type;
> };
>
> +#define EVT(id, offset, _type) { .evtid = id, .evt_offset = offset, .type = _type }
> +
> /**
> * struct telem_entry - Summarized form from XML telemetry description
> * @name: Name for this group of events
> @@ -82,8 +84,63 @@ static struct evtinfo {
>
> #define EVT_OFFSET(evtid) (evtinfo[evtid].pmt_event->evt_offset)
>
> +/*
> + * https://github.com/intel/Intel-PMT
> + * xml/CWF/OOBMSM/RMID-ENERGY *.xml
This can be one line.
> + */
> +#define NUM_RMIDS_0x26696143 576
> +#define GUID_0x26696143 0x26696143
> +#define NUM_EVENTS_0x26696143 2
> +#define EVT_BYTES_0x26696143 (NUM_RMIDS_0x26696143 * NUM_EVENTS_0x26696143 * sizeof(u64))
> +
> +static struct telem_entry energy_0x26696143 = {
> + .name = "energy",
> + .guid = GUID_0x26696143,
> + .size = EVT_BYTES_0x26696143 + sizeof(u64) * 3,
> + .num_rmids = NUM_RMIDS_0x26696143,
> + .overflow_counter_off = EVT_BYTES_0x26696143 + sizeof(u64) * 0,
> + .last_overflow_tstamp_off = EVT_BYTES_0x26696143 + sizeof(u64) * 1,
> + .last_update_tstamp_off = EVT_BYTES_0x26696143 + sizeof(u64) * 2,
> + .num_events = NUM_EVENTS_0x26696143,
> + .evts = {
> + EVT(PMT_EVENT_ENERGY, 0x0, EVT_TYPE_U46_18),
> + EVT(PMT_EVENT_ACTIVITY, 0x8, EVT_TYPE_U46_18),
> + }
> +};
> +
> +/*
> + * https://github.com/intel/Intel-PMT
> + * xml/CWF/OOBMSM/RMID-PERF *.xml
This can be one line.
> + */
> +#define NUM_RMIDS_0x26557651 576
> +#define GUID_0x26557651 0x26557651
> +#define NUM_EVENTS_0x26557651 7
> +#define EVT_BYTES_0x26557651 (NUM_RMIDS_0x26557651 * NUM_EVENTS_0x26557651 * sizeof(u64))
> +
> +static struct telem_entry perf_0x26557651 = {
> + .name = "perf",
> + .guid = GUID_0x26557651,
> + .size = EVT_BYTES_0x26557651 + sizeof(u64) * 3,
> + .num_rmids = NUM_RMIDS_0x26557651,
> + .overflow_counter_off = EVT_BYTES_0x26557651 + sizeof(u64) * 0,
> + .last_overflow_tstamp_off = EVT_BYTES_0x26557651 + sizeof(u64) * 1,
> + .last_update_tstamp_off = EVT_BYTES_0x26557651 + sizeof(u64) * 2,
> + .num_events = NUM_EVENTS_0x26557651,
> + .evts = {
> + EVT(PMT_EVENT_STALLS_LLC_HIT, 0x0, EVT_TYPE_U64),
> + EVT(PMT_EVENT_C1_RES, 0x8, EVT_TYPE_U64),
> + EVT(PMT_EVENT_UNHALTED_CORE_CYCLES, 0x10, EVT_TYPE_U64),
> + EVT(PMT_EVENT_STALLS_LLC_MISS, 0x18, EVT_TYPE_U64),
> + EVT(PMT_EVENT_AUTO_C6_RES, 0x20, EVT_TYPE_U64),
> + EVT(PMT_EVENT_UNHALTED_REF_CYCLES, 0x28, EVT_TYPE_U64),
> + EVT(PMT_EVENT_UOPS_RETIRED, 0x30, EVT_TYPE_U64),
> + }
> +};
> +
> /* All known telemetry event groups */
> static struct telem_entry *telem_entry[] = {
> + &energy_0x26696143,
> + &perf_0x26557651,
> NULL
Looks like a change from previous design to use telem_entry::num_events instead
of NULL entry. The NULL entry can thus be removed?
> };
>
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 26/26] x86/resctrl: Update Documentation for package events
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (24 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 25/26] fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events Tony Luck
@ 2025-04-07 23:40 ` Tony Luck
2025-04-19 5:40 ` Reinette Chatre
2025-04-18 21:13 ` [PATCH v3 00/26] x86/resctrl telemetry monitoring Reinette Chatre
2025-04-19 5:47 ` Reinette Chatre
27 siblings, 1 reply; 67+ messages in thread
From: Tony Luck @ 2025-04-07 23:40 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches, Tony Luck
Each "mon_data" directory is now divided between L3 events and package
events.
The "info/PERF_PKG_MON" directory contains parameters for perf events.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Documentation/filesystems/resctrl.rst | 38 ++++++++++++++++++++-------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index 6768fc1fad16..b89a188b0321 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -167,7 +167,7 @@ with respect to allocation:
bandwidth percentages are directly applied to
the threads running on the core
-If RDT monitoring is available there will be an "L3_MON" directory
+If RDT L3 monitoring is available there will be an "L3_MON" directory
with the following files:
"num_rmids":
@@ -261,6 +261,17 @@ with the following files:
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
+If RDT PERF monitoring is available there will be an "L3_PERF_PKG" directory
+with the following files:
+
+"num_rmids":
+ The guaranteed number of hardware countes supporting RMIDs.
+ If more "CTRL_MON" + "MON" groups than this number are created,
+ the system may report that counters are "unavailable" when read.
+
+"mon_features":
+ Lists the perf monitoring events that are enabled on this system.
+
Finally, in the top level of the "info" directory there is a file
named "last_cmd_status". This is reset with every "command" issued
via the file system (making new directories or writing to any of the
@@ -366,15 +377,22 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:
"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
- "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
- files provide a read out of the current value of the event for
- all tasks in the group. In CTRL_MON groups these files provide
- the sum for all tasks in the CTRL_MON group and all tasks in
- MON groups. Please see example section for more details on usage.
+ This contains a set of directories, one for each instance
+ of an L3 cache, or of a processor package. The L3 cache
+ directories are named "mon_L3_00", "mon_L3_01" etc. The
+ package directories "mon_PERF_PKG_00", "mon_PERF_PKG_01" etc.
+
+ Within each directory there is one file per event. In
+ the L3 directories: "llc_occupancy", "mbm_total_bytes",
+ and "mbm_local_bytes". In the PERF_PKG directories: "core_energy",
+ "activity", etc.
+
+ In a MON group these files provide a read out of the current
+ value of the event for all tasks in the group. In CTRL_MON groups
+ these files provide the sum for all tasks in the CTRL_MON group
+ and all tasks in MON groups. Please see example section for more
+ details on usage.
+
On systems with Sub-NUMA Cluster (SNC) enabled there are extra
directories for each node (located within the "mon_L3_XX" directory
for the L3 cache they occupy). These are named "mon_sub_L3_YY"
--
2.48.1
^ permalink raw reply related [flat|nested] 67+ messages in thread* Re: [PATCH v3 26/26] x86/resctrl: Update Documentation for package events
2025-04-07 23:40 ` [PATCH v3 26/26] x86/resctrl: Update Documentation for package events Tony Luck
@ 2025-04-19 5:40 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 5:40 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Each "mon_data" directory is now divided between L3 events and package
> events.
>
> The "info/PERF_PKG_MON" directory contains parameters for perf events.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Documentation/filesystems/resctrl.rst | 38 ++++++++++++++++++++-------
> 1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index 6768fc1fad16..b89a188b0321 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -167,7 +167,7 @@ with respect to allocation:
> bandwidth percentages are directly applied to
> the threads running on the core
>
> -If RDT monitoring is available there will be an "L3_MON" directory
> +If RDT L3 monitoring is available there will be an "L3_MON" directory
I think "RDT" can just be dropped.
> with the following files:
>
> "num_rmids":
> @@ -261,6 +261,17 @@ with the following files:
> bytes) at which a previously used LLC_occupancy
> counter can be considered for re-use.
>
> +If RDT PERF monitoring is available there will be an "L3_PERF_PKG" directory
"L3_PERF_PKG" -> "PERF_PKG_MON" ?
I understand that the existing L3 documentation contains this term but I do not
see a reason why the documentation should make this new monitoring Intel/RDT specific.
Also, I do not think user can be expected to know what "perf monitoring" is.
> +with the following files:
> +
> +"num_rmids":
> + The guaranteed number of hardware countes supporting RMIDs.
countes -> counters?
The use of "hardware counters" is a bit unexpected ... the series did not mention
this or I must have missed this.
> + If more "CTRL_MON" + "MON" groups than this number are created,
> + the system may report that counters are "unavailable" when read.
To be precise it is "Unavailable" ... but I do not think that is a good interface.
> +
> +"mon_features":
> + Lists the perf monitoring events that are enabled on this system.
"PERF" (all caps) at top and "perf" lower case here?
> +
> Finally, in the top level of the "info" directory there is a file
> named "last_cmd_status". This is reset with every "command" issued
> via the file system (making new directories or writing to any of the
> @@ -366,15 +377,22 @@ When control is enabled all CTRL_MON groups will also contain:
> When monitoring is enabled all MON groups will also contain:
>
> "mon_data":
> - This contains a set of files organized by L3 domain and by
> - RDT event. E.g. on a system with two L3 domains there will
> - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
> - directories have one file per event (e.g. "llc_occupancy",
> - "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> - files provide a read out of the current value of the event for
> - all tasks in the group. In CTRL_MON groups these files provide
> - the sum for all tasks in the CTRL_MON group and all tasks in
> - MON groups. Please see example section for more details on usage.
> + This contains a set of directories, one for each instance
> + of an L3 cache, or of a processor package. The L3 cache
> + directories are named "mon_L3_00", "mon_L3_01" etc. The
> + package directories "mon_PERF_PKG_00", "mon_PERF_PKG_01" etc.
> +
> + Within each directory there is one file per event. In
> + the L3 directories: "llc_occupancy", "mbm_total_bytes",
> + and "mbm_local_bytes". In the PERF_PKG directories: "core_energy",
stray tab here
> + "activity", etc.
> +
> + In a MON group these files provide a read out of the current
> + value of the event for all tasks in the group. In CTRL_MON groups
> + these files provide the sum for all tasks in the CTRL_MON group
> + and all tasks in MON groups. Please see example section for more
> + details on usage.
> +
> On systems with Sub-NUMA Cluster (SNC) enabled there are extra
> directories for each node (located within the "mon_L3_XX" directory
> for the L3 cache they occupy). These are named "mon_sub_L3_YY"
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (25 preceding siblings ...)
2025-04-07 23:40 ` [PATCH v3 26/26] x86/resctrl: Update Documentation for package events Tony Luck
@ 2025-04-18 21:13 ` Reinette Chatre
2025-04-21 18:57 ` Luck, Tony
2025-04-19 5:47 ` Reinette Chatre
27 siblings, 1 reply; 67+ messages in thread
From: Reinette Chatre @ 2025-04-18 21:13 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
On 4/7/25 4:40 PM, Tony Luck wrote:
> Previous version here:
> https://lore.kernel.org/all/20250321231609.57418-1-tony.luck@intel.com/
>
> This series is based on James Morse's "fs/resctrl/" snapshot.
Would be helpful to provide link to snapshot used to avoid any uncertainty
about what base to use.
>
> Background
>
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOMMSM device to be
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.
>
> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package. In the initial
> hardware implementation, there are two categories of events:
>
(missing the two categories of events)
> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
>
> Offset: Counter
> 0x00 core energy for RMID 0
> 0x08 core activity for RMID 0
> 0x10 core energy for RMID 1
> 0x18 core activity for RMID 1
>
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.
>
> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).
>
> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with mush lower overhead (no need to collect data
"mush" -> "much"
> on every context switch).
>
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC
>
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks
>
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles
>
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches
>
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
>
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
>
> uops_retired - Counts the total number of uops retired
>
> Enumeration
>
> The only CPUID based enumeration for this feature is the legacy
> CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
> IA32_PQR_ASSOC MSR and the RMID field within it.
>
> The OOBMSM driver discovers which features are present via
> PCIe VSEC capabilities. Each feature is tagged with a unique
> identifier. These identifiers indicate which XML description file from
> https://github.com/intel/Intel-PMT describes which event counters are
> available and their layout within the MMIO BAR space of the OOBMSM device.
>
> Resctrl User Interface
>
> Because there may be multiple OOBMSM collection agents per processor
> package, resctrl accumulates event counts from all agents on a package
> and presents a single value to users. This will provide a consistent
> user interface on future platforms that vary the number of collectors,
> or the mappings from logical CPUs to collectors.
>
> Users will see the legacy monitoring files in the "L3" directories
> and the telemetry files in "PKG" directories (with each file
Now PERF_PKG?
> providing the aggregated value from all OOBMSM collectors on that
> package).
>
> $ tree /sys/fs/resctrl/mon_data/
> /sys/fs/resctrl/mon_data/
> ├── mon_L3_00
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_PKG_00
> │ ├── activity
> │ ├── c1_res
> │ ├── c6_res
> │ ├── core_energy
> │ ├── stalls_llc_hit
> │ ├── stalls_llc_miss
> │ ├── unhalted_core_cycles
> │ ├── unhalted_ref_cycles
> │ └── uops_retired
> └── mon_PKG_01
> ├── activity
> ├── c1_res
> ├── c6_res
> ├── core_energy
> ├── stalls_llc_hit
> ├── stalls_llc_miss
> ├── unhalted_core_cycles
> ├── unhalted_ref_cycles
> └── uops_retired
>
> Resctrl Implementation
>
> The OOBMSM driver exposes a function "intel_pmt_get_regions_by_feature()"
(nit: no need to use "a function" if using ())
> that returns an array of structures describing the per-RMID groups it
> found from the VSEC enumeration. Linux looks at the unique identifiers
> for each group and enables resctrl for all groups with known unique
> identifiers.
>
> The memory map for the counters for each <RMID, event> pair is described
> by the XML file. This is too unwieldy to use in the Linux kernel, so a
> simplified representation is built into the resctrl code. Note that the
> counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> calls to read counters from a CPU in a specific domain. The counters
> can be read from any CPU.
>
> High level description of code changes:
>
> 1) New scope RESCTRL_PACKAGE
> 2) New struct rdt_resource RDT_RESOURCE_INTEL_PMT
> 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
> switch (r->rid) {
> case RDT_RESOURCE_L3:
> helper for L3
> break;
> case RDT_RESOURCE_INTEL_PMT:
> helper for PKG
> break;
> }
> 4) New source code file "intel_pmt.c" for the code to enumerate, configure, and report event counts.
Needs an update to match new version of this work.
>
> With only one platform providing this feature, it's tricky to tell
> exactly where it is going to go. I've made the event definitions
> platform specific (based on the unique ID from the VSEC enumeration). It
> seems possible/likely that the list of events may change from generation
> to generation.
>
> I've picked names for events based on the descriptions in the XML file.
One aspect that is only hinted to in the final documentation patch is
how users are expected to use this feature. As I understand the number of
monitor groups supported by resctrl is still guided by the number of RMIDs
supported by L3 monitoring. This work hints that the telemetry feature may
not match that number of RMIDs and a monitor group may thus exist but
when a user attempts to ready any of these perf files it will return
"unavailable".
The series attempts to address it by placing the number of RMIDs available
for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
group is not exposed to user space (unless debugging enabled) the user does
not know if a monitor group will support this feature or not. This seems awkward
to me. Why not limit the number of monitor groups that can be created to the
minimum number of RMIDs across these resources like what is done for CLOSid?
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
2025-04-18 21:13 ` [PATCH v3 00/26] x86/resctrl telemetry monitoring Reinette Chatre
@ 2025-04-21 18:57 ` Luck, Tony
2025-04-21 22:59 ` Reinette Chatre
0 siblings, 1 reply; 67+ messages in thread
From: Luck, Tony @ 2025-04-21 18:57 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
> One aspect that is only hinted to in the final documentation patch is
> how users are expected to use this feature. As I understand the number of
> monitor groups supported by resctrl is still guided by the number of RMIDs
> supported by L3 monitoring. This work hints that the telemetry feature may
> not match that number of RMIDs and a monitor group may thus exist but
> when a user attempts to ready any of these perf files it will return
> "unavailable".
>
> The series attempts to address it by placing the number of RMIDs available
> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
> group is not exposed to user space (unless debugging enabled) the user does
> not know if a monitor group will support this feature or not. This seems awkward
> to me. Why not limit the number of monitor groups that can be created to the
> minimum number of RMIDs across these resources like what is done for CLOSid?
Reinette,
The mismatch between number of RMIDs supported by different components
is a thorny one, and may keep repeating since it feels like systems are
composed of a bunch of lego-like bricks snapped together from a box of
parts available to the h/w architect.
In this case we have three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF.
2) The number of registers in MMIO space for each event. This is
enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
3) The number of "h/w counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that
does describe the limitations) feeding to those MMIO registers. This is
enumerated in telemetry_region::num_rmids returned from the call to
intel_pmt_get_regions_by_feature()
If "1" is the smallest of these values, the OS will be limited in
which values can be written to the IA32_PQR_ASSOC MSR. Existing
code will do the right thing by limiting RMID allocation to this
value.
If "2" is greater than "1", then the extra MMIO registers will
sit unused.
If "2" is less than "1" my v3 returns the (problematic) -ENOENT
This can't happen in the CPU that debuts this feature, but the check
is there to prevent running past the end of the MMIO space in case
this does occur some day. I'll fix error path in next version to
make sure this end up with "Unavailable".
If "3" is less than "2" then the system will attach "h/w counters" to
MMIO registers in a "most recently used" algorithm. So if the number
of active RMIDs in some time interval is less than "3" the user will
get good values. But if the number of active RMIDs rises above "3"
then the user will see "Unavailable" returns as "h/w counters" are
reassigned to different RMIDs (making the feature really hard to use).
In the debut CPU the "energy" feature has sufficient "energy" counters
to avoid this. But not enough "perf" counters. I've pushed and the
next CPU with the feature will have enough "h/w counters".
My proposal for v4:
Add new options to the "rdt=" kernel boot parameter for "energy"
and "perf".
Treat the case where there are not enough "h/w counters" as an erratum
and do not enable the feature. User can override with "rdt=perf"
if they want the counters for some special case where they limit
the number of simultaneous active RMIDs.
User can use "rdt=!energy,!perf" if they don't want to see the
clutter of all the new files in each mon_data directory.
I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
and add a "take min of all RMID limits". But since this is a "can't
happen" scenario I may skip this if it starts to get complicated.
Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
possible that some CPU implementation will have different MMIO
register counts for "perf" and "energy". It's more than possible
that number of "h/w counters" will be different. But they share the
same info file. My v3 code reports the minimum of the number
of "h/w counters" which is the most conservative option. It tells
the user not to make more mon_data directories than this if they
want usable counters across *both* perf and energy. Though they
will actually keep getting good "energy" results even if then
go past this limit.
-Tony
>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
2025-04-21 18:57 ` Luck, Tony
@ 2025-04-21 22:59 ` Reinette Chatre
2025-04-22 16:20 ` Luck, Tony
0 siblings, 1 reply; 67+ messages in thread
From: Reinette Chatre @ 2025-04-21 22:59 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Tony,
On 4/21/25 11:57 AM, Luck, Tony wrote:
> On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
>> One aspect that is only hinted to in the final documentation patch is
>> how users are expected to use this feature. As I understand the number of
>> monitor groups supported by resctrl is still guided by the number of RMIDs
>> supported by L3 monitoring. This work hints that the telemetry feature may
>> not match that number of RMIDs and a monitor group may thus exist but
>> when a user attempts to ready any of these perf files it will return
>> "unavailable".
>>
>> The series attempts to address it by placing the number of RMIDs available
>> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
>> group is not exposed to user space (unless debugging enabled) the user does
>> not know if a monitor group will support this feature or not. This seems awkward
>> to me. Why not limit the number of monitor groups that can be created to the
>> minimum number of RMIDs across these resources like what is done for CLOSid?
>
> Reinette,
>
> The mismatch between number of RMIDs supported by different components
> is a thorny one, and may keep repeating since it feels like systems are
> composed of a bunch of lego-like bricks snapped together from a box of
> parts available to the h/w architect.
With resctrl needing to support multiple architectures' way of doing things,
needing to support variety within an architecture just seems like another step.
>
> In this case we have three meanings for "number of RMIDs":
>
> 1) The number for legacy features enumerated by CPUID leaf 0xF.
>
> 2) The number of registers in MMIO space for each event. This is
> enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
>
> 3) The number of "h/w counters" (this isn't a strictly accurate
> description of how things work, but serves as a useful analogy that
> does describe the limitations) feeding to those MMIO registers. This is
> enumerated in telemetry_region::num_rmids returned from the call to
> intel_pmt_get_regions_by_feature()
Thank you for explaining this. This was not clear to me.
>
> If "1" is the smallest of these values, the OS will be limited in
> which values can be written to the IA32_PQR_ASSOC MSR. Existing
> code will do the right thing by limiting RMID allocation to this
> value.
>
> If "2" is greater than "1", then the extra MMIO registers will
> sit unused.
This is also an issue with this implementation, no? resctrl will not
allow creating more monitor groups than "1".
> If "2" is less than "1" my v3 returns the (problematic) -ENOENT
> This can't happen in the CPU that debuts this feature, but the check
> is there to prevent running past the end of the MMIO space in case
> this does occur some day. I'll fix error path in next version to
> make sure this end up with "Unavailable".
This is a concern since this means the interface becomes a "try and see"
for user space. As I understand a later statement the idea is that
"2" should be used by user space to know how many "mon_groups" directories
should be created to get telemetry support. To me this looks to be
a space that will create a lot of confusion. The moment user space
creates "2" + 1 "mon_groups" directories it becomes a guessing game
of what any new monitor group actually supports. After crossing that
threshold I do not see a good way for going back since if user space
removes one "mon_data" directory it does get back to "2" but then needs to
rely on resctrl internals or debugging to know for sure what the new
monitor group supports.
>
> If "3" is less than "2" then the system will attach "h/w counters" to
> MMIO registers in a "most recently used" algorithm. So if the number
> of active RMIDs in some time interval is less than "3" the user will
> get good values. But if the number of active RMIDs rises above "3"
> then the user will see "Unavailable" returns as "h/w counters" are
> reassigned to different RMIDs (making the feature really hard to use).
Could the next step be for the architecture to allow user space to
specify which hardware counters need to be assigned? With a new user
interface being created for such capability it may be worthwhile to
consider how it could be used/adapted for this feature. [1]
>
> In the debut CPU the "energy" feature has sufficient "energy" counters
> to avoid this. But not enough "perf" counters. I've pushed and the
> next CPU with the feature will have enough "h/w counters".
>
> My proposal for v4:
>
> Add new options to the "rdt=" kernel boot parameter for "energy"
> and "perf".
>
> Treat the case where there are not enough "h/w counters" as an erratum
> and do not enable the feature. User can override with "rdt=perf"
> if they want the counters for some special case where they limit
> the number of simultaneous active RMIDs.
This only seems to address the "3" is less than "2" issue. It is not
so obvious to me that it should be treated as an erratum. Although,
I could not tell from your description how obvious this issue will be
to user space. For example, is it clear that if user space
gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
could a returned value mean "this is partial data that was collected
during timeframe with hardware counter re-assigned at some point"?
>
> User can use "rdt=!energy,!perf" if they don't want to see the
> clutter of all the new files in each mon_data directory.
>
> I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
> and add a "take min of all RMID limits". But since this is a "can't
> happen" scenario I may skip this if it starts to get complicated.
I do not think that the "2" is less than "1" scenario should be
ignored for reasons stated above and in review of this version.
What if we enhance resctrl's RMID assignment (setting aside for
a moment PMG assignment) to be directed by user space?
Below is an idea of an interface that can give user space
control over what monitor groups are monitoring. This is very likely not
the ideal interface but I would like to present it as a start for
better ideas.
For example, monitor groups are by default created with most abundant
(and thus supporting fewest features on fewest resources) RMID.
The user is then presented with a new file (within each monitor group)
that lists all available features and which one(s) are active. For example,
let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
that only supports L3 monitoring:
# cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
[L3]
PERF_PKG:energy
PERF_PKG:perf
In above case there will be *no* mon_PERF_PKG_XX directories in
/sys/fs/resctrl/mon_groups/m1/mon_data.
*If* user space wants perf/energy telemetry for this monitor
group then they can enable needed feature with clear understanding that
it is disruptive to all ongoing monitoring since a new RMID will be assigned.
For example, if user wants PERF_PKG:energy and PERF_PKG:perf then
user can do so with:
# echo PERF_PKG:perf > /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
# cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
[L3]
[PERF_PKG:energy]
[PERF_PKG:perf]
After the above all energy and perf files will appear in new mon_PERF_PKG_XX
directories.
User space can then have full control of what is monitored by which monitoring
group. If no RMIDs are available in a particular pool then user space can get
an "out of space" error and be the one to decide how it should be managed.
This also could be a way in which the "2" is larger than "1" scenario
can be addressed.
> Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
> possible that some CPU implementation will have different MMIO
> register counts for "perf" and "energy". It's more than possible
> that number of "h/w counters" will be different. But they share the
> same info file. My v3 code reports the minimum of the number
> of "h/w counters" which is the most conservative option. It tells
> the user not to make more mon_data directories than this if they
> want usable counters across *both* perf and energy. Though they
> will actually keep getting good "energy" results even if then
> go past this limit.
num_rmids is a source of complications. It does not have a good equivalent
for MPAM and there has been a few attempts at proposing alternatives that may
be worth keeping in mind while making changes here:
https://lore.kernel.org/all/cbe665c2-fe83-e446-1696-7115c0f9fd76@arm.com/
https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
>
> -Tony
>
Reinette
[1] https://lore.kernel.org/lkml/cover.1743725907.git.babu.moger@amd.com/
ps. I needed to go back an re-read the original cover-letter a couple of
times, while doing so I noticed one typo in the Background section: OOMMSM -> OOBMSM.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
2025-04-21 22:59 ` Reinette Chatre
@ 2025-04-22 16:20 ` Luck, Tony
2025-04-22 21:30 ` Reinette Chatre
0 siblings, 1 reply; 67+ messages in thread
From: Luck, Tony @ 2025-04-22 16:20 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
On Mon, Apr 21, 2025 at 03:59:15PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 4/21/25 11:57 AM, Luck, Tony wrote:
> > On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
> >> One aspect that is only hinted to in the final documentation patch is
> >> how users are expected to use this feature. As I understand the number of
> >> monitor groups supported by resctrl is still guided by the number of RMIDs
> >> supported by L3 monitoring. This work hints that the telemetry feature may
> >> not match that number of RMIDs and a monitor group may thus exist but
> >> when a user attempts to ready any of these perf files it will return
> >> "unavailable".
> >>
> >> The series attempts to address it by placing the number of RMIDs available
> >> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
> >> group is not exposed to user space (unless debugging enabled) the user does
> >> not know if a monitor group will support this feature or not. This seems awkward
> >> to me. Why not limit the number of monitor groups that can be created to the
> >> minimum number of RMIDs across these resources like what is done for CLOSid?
> >
> > Reinette,
> >
> > The mismatch between number of RMIDs supported by different components
> > is a thorny one, and may keep repeating since it feels like systems are
> > composed of a bunch of lego-like bricks snapped together from a box of
> > parts available to the h/w architect.
>
> With resctrl needing to support multiple architectures' way of doing things,
> needing to support variety within an architecture just seems like another step.
>
> >
> > In this case we have three meanings for "number of RMIDs":
> >
> > 1) The number for legacy features enumerated by CPUID leaf 0xF.
> >
> > 2) The number of registers in MMIO space for each event. This is
> > enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
> >
> > 3) The number of "h/w counters" (this isn't a strictly accurate
> > description of how things work, but serves as a useful analogy that
> > does describe the limitations) feeding to those MMIO registers. This is
> > enumerated in telemetry_region::num_rmids returned from the call to
> > intel_pmt_get_regions_by_feature()
>
> Thank you for explaining this. This was not clear to me.
>
> >
> > If "1" is the smallest of these values, the OS will be limited in
> > which values can be written to the IA32_PQR_ASSOC MSR. Existing
> > code will do the right thing by limiting RMID allocation to this
> > value.
> >
> > If "2" is greater than "1", then the extra MMIO registers will
> > sit unused.
>
> This is also an issue with this implementation, no? resctrl will not
> allow creating more monitor groups than "1".
On Intel there is no point in creating more groups than "1" allows.
You can't make use of any RMID above that limit because you will get
a #GP fault trying to write to the IA32_PQR_ASSOC MSR.
You could read the extra MMIO registers provided by "2", but they
will always be zero since no execution occurred with an RMID in the
range "1" ... "2".
The "2" is greater than "1" may be relatively common since the h/w
for the telemetry counters is common for SKUs with different numbers
of cores, and thus different values of "1". So low core count
systems will see more telemetry counters than they can actually
make use of. I will make sure not to print a message for this case.
> > If "2" is less than "1" my v3 returns the (problematic) -ENOENT
> > This can't happen in the CPU that debuts this feature, but the check
> > is there to prevent running past the end of the MMIO space in case
> > this does occur some day. I'll fix error path in next version to
> > make sure this end up with "Unavailable".
>
> This is a concern since this means the interface becomes a "try and see"
> for user space. As I understand a later statement the idea is that
> "2" should be used by user space to know how many "mon_groups" directories
> should be created to get telemetry support. To me this looks to be
> a space that will create a lot of confusion. The moment user space
> creates "2" + 1 "mon_groups" directories it becomes a guessing game
> of what any new monitor group actually supports. After crossing that
> threshold I do not see a good way for going back since if user space
> removes one "mon_data" directory it does get back to "2" but then needs to
> rely on resctrl internals or debugging to know for sure what the new
> monitor group supports.
But I assert that it is a "can't happen" concern. "2" will be >= "1".
See below. I will look at addressing this, unless it gets crazy complex
because of the different enumeration timeline. Delaying calculation of
number of RMIDs until rdt_get_tree() as you have suggested may be the
right thing to do.
"3" is the real problem
> >
> > If "3" is less than "2" then the system will attach "h/w counters" to
> > MMIO registers in a "most recently used" algorithm. So if the number
> > of active RMIDs in some time interval is less than "3" the user will
> > get good values. But if the number of active RMIDs rises above "3"
> > then the user will see "Unavailable" returns as "h/w counters" are
> > reassigned to different RMIDs (making the feature really hard to use).
>
> Could the next step be for the architecture to allow user space to
> specify which hardware counters need to be assigned? With a new user
> interface being created for such capability it may be worthwhile to
> consider how it could be used/adapted for this feature. [1]
>
> >
> > In the debut CPU the "energy" feature has sufficient "energy" counters
> > to avoid this. But not enough "perf" counters. I've pushed and the
> > next CPU with the feature will have enough "h/w counters".
> >
> > My proposal for v4:
> >
> > Add new options to the "rdt=" kernel boot parameter for "energy"
> > and "perf".
> >
> > Treat the case where there are not enough "h/w counters" as an erratum
> > and do not enable the feature. User can override with "rdt=perf"
> > if they want the counters for some special case where they limit
> > the number of simultaneous active RMIDs.
>
> This only seems to address the "3" is less than "2" issue. It is not
> so obvious to me that it should be treated as an erratum. Although,
> I could not tell from your description how obvious this issue will be
> to user space. For example, is it clear that if user space
> gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
> could a returned value mean "this is partial data that was collected
> during timeframe with hardware counter re-assigned at some point"?
When running jobs with more distinct RMIDs than "3" users are at the
mercy of the h/w replacement algorithm. Resctrl use cases for monitoring
are all "read an event counter; wait for some time; re-read the event
counter; compute the rate". With "h/w counter" reassignment the second
read may get "Unavailable", or worse the "h/w counter" may have been
taken, and the returned so a value will be provided to the user, but
it won't provide the count of events since the first read.
That's why I consider this an erratum. There's just false hope that
you can get a pair of meaningful event counts and no sure indication
that you didn't get garbage.
> >
> > User can use "rdt=!energy,!perf" if they don't want to see the
> > clutter of all the new files in each mon_data directory.
> >
> > I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
> > and add a "take min of all RMID limits". But since this is a "can't
> > happen" scenario I may skip this if it starts to get complicated.
>
> I do not think that the "2" is less than "1" scenario should be
> ignored for reasons stated above and in review of this version.
>
> What if we enhance resctrl's RMID assignment (setting aside for
> a moment PMG assignment) to be directed by user space?
I'll take a look at reducing user reported num_rmids to the minimum
of the "1" and "2" values.
> Below is an idea of an interface that can give user space
> control over what monitor groups are monitoring. This is very likely not
> the ideal interface but I would like to present it as a start for
> better ideas.
>
> For example, monitor groups are by default created with most abundant
> (and thus supporting fewest features on fewest resources) RMID.
> The user is then presented with a new file (within each monitor group)
> that lists all available features and which one(s) are active. For example,
> let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
> has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
> creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
> that only supports L3 monitoring:
There is no way for s/w to control the reallocation of "h/w counters"
when "3" is too small. So there is no set of RMIDs that support many
events vs. fewer events. AMD is solving this similar problem with their
scheme to pin h/w counters to specific RMIDs. I discussed such an option
for the "3" case, but it wasn't practical to apply to the upcoming CPU
that has this problem. The long term solution is to ensure that "3" is
always large enough that all RMIDs have equal monitoring capabilities.
> # cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
> [L3]
> PERF_PKG:energy
> PERF_PKG:perf
>
> In above case there will be *no* mon_PERF_PKG_XX directories in
> /sys/fs/resctrl/mon_groups/m1/mon_data.
>
> *If* user space wants perf/energy telemetry for this monitor
> group then they can enable needed feature with clear understanding that
> it is disruptive to all ongoing monitoring since a new RMID will be assigned.
> For example, if user wants PERF_PKG:energy and PERF_PKG:perf then
> user can do so with:
>
> # echo PERF_PKG:perf > /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
> # cat /sys/fs/resctrl/mon_groups/m1/new_file_mon_resource_and_features
> [L3]
> [PERF_PKG:energy]
> [PERF_PKG:perf]
>
> After the above all energy and perf files will appear in new mon_PERF_PKG_XX
> directories.
>
> User space can then have full control of what is monitored by which monitoring
> group. If no RMIDs are available in a particular pool then user space can get
> an "out of space" error and be the one to decide how it should be managed.
>
> This also could be a way in which the "2" is larger than "1" scenario
> can be addressed.
>
> > Which leaves what should be in info/PERF_PKG_MON/num_rmids? It's
> > possible that some CPU implementation will have different MMIO
> > register counts for "perf" and "energy". It's more than possible
> > that number of "h/w counters" will be different. But they share the
> > same info file. My v3 code reports the minimum of the number
> > of "h/w counters" which is the most conservative option. It tells
> > the user not to make more mon_data directories than this if they
> > want usable counters across *both* perf and energy. Though they
> > will actually keep getting good "energy" results even if then
> > go past this limit.
>
> num_rmids is a source of complications. It does not have a good equivalent
> for MPAM and there has been a few attempts at proposing alternatives that may
> be worth keeping in mind while making changes here:
> https://lore.kernel.org/all/cbe665c2-fe83-e446-1696-7115c0f9fd76@arm.com/
> https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
>
> >
> > -Tony
> >
>
> Reinette
>
>
> [1] https://lore.kernel.org/lkml/cover.1743725907.git.babu.moger@amd.com/
>
> ps. I needed to go back an re-read the original cover-letter a couple of
> times, while doing so I noticed one typo in the Background section: OOMMSM -> OOBMSM.
Noted. Will fix.
-Tony
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
2025-04-22 16:20 ` Luck, Tony
@ 2025-04-22 21:30 ` Reinette Chatre
0 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-22 21:30 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
linux-kernel, patches
Hi Tony,
On 4/22/25 9:20 AM, Luck, Tony wrote:
> On Mon, Apr 21, 2025 at 03:59:15PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 4/21/25 11:57 AM, Luck, Tony wrote:
>>> On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
>>>> One aspect that is only hinted to in the final documentation patch is
>>>> how users are expected to use this feature. As I understand the number of
>>>> monitor groups supported by resctrl is still guided by the number of RMIDs
>>>> supported by L3 monitoring. This work hints that the telemetry feature may
>>>> not match that number of RMIDs and a monitor group may thus exist but
>>>> when a user attempts to ready any of these perf files it will return
>>>> "unavailable".
>>>>
>>>> The series attempts to address it by placing the number of RMIDs available
>>>> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
>>>> group is not exposed to user space (unless debugging enabled) the user does
>>>> not know if a monitor group will support this feature or not. This seems awkward
>>>> to me. Why not limit the number of monitor groups that can be created to the
>>>> minimum number of RMIDs across these resources like what is done for CLOSid?
>>>
>>> Reinette,
>>>
>>> The mismatch between number of RMIDs supported by different components
>>> is a thorny one, and may keep repeating since it feels like systems are
>>> composed of a bunch of lego-like bricks snapped together from a box of
>>> parts available to the h/w architect.
>>
>> With resctrl needing to support multiple architectures' way of doing things,
>> needing to support variety within an architecture just seems like another step.
>>
>>>
>>> In this case we have three meanings for "number of RMIDs":
>>>
>>> 1) The number for legacy features enumerated by CPUID leaf 0xF.
>>>
>>> 2) The number of registers in MMIO space for each event. This is
>>> enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
>>>
>>> 3) The number of "h/w counters" (this isn't a strictly accurate
>>> description of how things work, but serves as a useful analogy that
>>> does describe the limitations) feeding to those MMIO registers. This is
>>> enumerated in telemetry_region::num_rmids returned from the call to
>>> intel_pmt_get_regions_by_feature()
>>
>> Thank you for explaining this. This was not clear to me.
>>
>>>
>>> If "1" is the smallest of these values, the OS will be limited in
>>> which values can be written to the IA32_PQR_ASSOC MSR. Existing
>>> code will do the right thing by limiting RMID allocation to this
>>> value.
>>>
>>> If "2" is greater than "1", then the extra MMIO registers will
>>> sit unused.
>>
>> This is also an issue with this implementation, no? resctrl will not
>> allow creating more monitor groups than "1".
>
> On Intel there is no point in creating more groups than "1" allows.
> You can't make use of any RMID above that limit because you will get
> a #GP fault trying to write to the IA32_PQR_ASSOC MSR.
>
> You could read the extra MMIO registers provided by "2", but they
> will always be zero since no execution occurred with an RMID in the
> range "1" ... "2".
>
> The "2" is greater than "1" may be relatively common since the h/w
> for the telemetry counters is common for SKUs with different numbers
> of cores, and thus different values of "1". So low core count
> systems will see more telemetry counters than they can actually
> make use of. I will make sure not to print a message for this case.
I see, thank you.
>
>>> If "2" is less than "1" my v3 returns the (problematic) -ENOENT
>>> This can't happen in the CPU that debuts this feature, but the check
>>> is there to prevent running past the end of the MMIO space in case
>>> this does occur some day. I'll fix error path in next version to
>>> make sure this end up with "Unavailable".
>>
>> This is a concern since this means the interface becomes a "try and see"
>> for user space. As I understand a later statement the idea is that
>> "2" should be used by user space to know how many "mon_groups" directories
>> should be created to get telemetry support. To me this looks to be
>> a space that will create a lot of confusion. The moment user space
>> creates "2" + 1 "mon_groups" directories it becomes a guessing game
>> of what any new monitor group actually supports. After crossing that
>> threshold I do not see a good way for going back since if user space
>> removes one "mon_data" directory it does get back to "2" but then needs to
>> rely on resctrl internals or debugging to know for sure what the new
>> monitor group supports.
>
> But I assert that it is a "can't happen" concern. "2" will be >= "1".
> See below. I will look at addressing this, unless it gets crazy complex
> because of the different enumeration timeline. Delaying calculation of
> number of RMIDs until rdt_get_tree() as you have suggested may be the
> right thing to do.
As you explain it, it does not sound as though calculating how many
RMIDs can be supported is required to be done in rdt_get_tree(), but doing
so would make the implementation more robust since doing so does not rely
on assumptions about what hardware can and will support.
>
> "3" is the real problem
>
>>>
>>> If "3" is less than "2" then the system will attach "h/w counters" to
>>> MMIO registers in a "most recently used" algorithm. So if the number
>>> of active RMIDs in some time interval is less than "3" the user will
>>> get good values. But if the number of active RMIDs rises above "3"
>>> then the user will see "Unavailable" returns as "h/w counters" are
>>> reassigned to different RMIDs (making the feature really hard to use).
>>
>> Could the next step be for the architecture to allow user space to
>> specify which hardware counters need to be assigned? With a new user
>> interface being created for such capability it may be worthwhile to
>> consider how it could be used/adapted for this feature. [1]
>>
>>>
>>> In the debut CPU the "energy" feature has sufficient "energy" counters
>>> to avoid this. But not enough "perf" counters. I've pushed and the
>>> next CPU with the feature will have enough "h/w counters".
>>>
>>> My proposal for v4:
>>>
>>> Add new options to the "rdt=" kernel boot parameter for "energy"
>>> and "perf".
>>>
>>> Treat the case where there are not enough "h/w counters" as an erratum
>>> and do not enable the feature. User can override with "rdt=perf"
>>> if they want the counters for some special case where they limit
>>> the number of simultaneous active RMIDs.
I get this now. This will require rework of the kernel command line parsing
support since current implementation is so closely integrated with the
X86_FEATURE_* flags (and is perhaps an unexpected architecture specific
portion of resctrl).
What if "rdt=perf" means that "3" is also included in the computation
of how many monitor groups are supported? That would help users to not
need to limit the number of simultaneous active RMIDs.
>>
>> This only seems to address the "3" is less than "2" issue. It is not
>> so obvious to me that it should be treated as an erratum. Although,
>> I could not tell from your description how obvious this issue will be
>> to user space. For example, is it clear that if user space
>> gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
>> could a returned value mean "this is partial data that was collected
>> during timeframe with hardware counter re-assigned at some point"?
>
> When running jobs with more distinct RMIDs than "3" users are at the
> mercy of the h/w replacement algorithm. Resctrl use cases for monitoring
> are all "read an event counter; wait for some time; re-read the event
> counter; compute the rate". With "h/w counter" reassignment the second
> read may get "Unavailable", or worse the "h/w counter" may have been
> taken, and the returned so a value will be provided to the user, but
> it won't provide the count of events since the first read.
>
> That's why I consider this an erratum. There's just false hope that
> you can get a pair of meaningful event counts and no sure indication
> that you didn't get garbage.
>
>>>
>>> User can use "rdt=!energy,!perf" if they don't want to see the
>>> clutter of all the new files in each mon_data directory.
>>>
>>> I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
>>> and add a "take min of all RMID limits". But since this is a "can't
>>> happen" scenario I may skip this if it starts to get complicated.
>>
>> I do not think that the "2" is less than "1" scenario should be
>> ignored for reasons stated above and in review of this version.
>>
>> What if we enhance resctrl's RMID assignment (setting aside for
>> a moment PMG assignment) to be directed by user space?
>
> I'll take a look at reducing user reported num_rmids to the minimum
> of the "1" and "2" values.
When comparing to "num_closids" the expectation may be that "num_rmids"
would be accurate for particular resource with understanding that the
minimum among all resources guides the number of monitor groups. This
seems close enough to existing interface to not use this as moment
to move to a new "num_mon_hw_id" or such that works for MPAM also.
>
>> Below is an idea of an interface that can give user space
>> control over what monitor groups are monitoring. This is very likely not
>> the ideal interface but I would like to present it as a start for
>> better ideas.
>>
>> For example, monitor groups are by default created with most abundant
>> (and thus supporting fewest features on fewest resources) RMID.
>> The user is then presented with a new file (within each monitor group)
>> that lists all available features and which one(s) are active. For example,
>> let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
>> has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
>> creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
>> that only supports L3 monitoring:
>
> There is no way for s/w to control the reallocation of "h/w counters"
> when "3" is too small. So there is no set of RMIDs that support many
> events vs. fewer events. AMD is solving this similar problem with their
> scheme to pin h/w counters to specific RMIDs. I discussed such an option
> for the "3" case, but it wasn't practical to apply to the upcoming CPU
> that has this problem. The long term solution is to ensure that "3" is
> always large enough that all RMIDs have equal monitoring capabilities.
ack.
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
` (26 preceding siblings ...)
2025-04-18 21:13 ` [PATCH v3 00/26] x86/resctrl telemetry monitoring Reinette Chatre
@ 2025-04-19 5:47 ` Reinette Chatre
27 siblings, 0 replies; 67+ messages in thread
From: Reinette Chatre @ 2025-04-19 5:47 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy
Cc: linux-kernel, patches
Hi Tony,
Just noticed ... could you please include x86 maintainers
(X86 ARCHITECTURE (32-BIT AND 64-BIT) in MAINTAINERS) in your
submission?
Thank you
Reinette
^ permalink raw reply [flat|nested] 67+ messages in thread