* [PATCH v4 00/31] x86/resctrl telemetry monitoring
@ 2025-04-29 0:33 Tony Luck
2025-04-29 0:33 ` [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable Tony Luck
` (30 more replies)
0 siblings, 31 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
These patches are based on James Morse's latest patch set to:
"Move the resctrl filesystem code to /fs/resctrl"
posted here:
Link: https://lore.kernel.org/all/20250425173809.5529-1-james.morse@arm.com/
Also available in the "mpam/move_to_fs/v9_final" branch of:
Link: git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git
I've pushed combination of James' series plus these patches to the
rdt-aet-v4 branch at:
Link: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
Extensive changes (based on feedback from Reinette) since v3 was posted here:
Link: https://lore.kernel.org/all/20250407234032.241215-1-tony.luck@intel.com/
Major changes:
1) Instead of using bits in the architecture local "rdt_mon_features"
variable to keep track of enabled monitor events, use mon_evt::enabled
to track at the file system layer. Architecture informs file system
which events are enabled. This means that file system no longer needs
any of the resctrl_arch_is_*_enabled() calls to architecture as it now
has the array of mon_evt structures to check. This is one step in making
the mon_evt structure the source of all information about each event.
2) Split the v3 "Prepare for more monitor events" patch into three
easier to digest pieces.
3) Simplified the "Improve domain type checking" patch by making
the rdt_resource type its own field in the rdt_domain_hdr structure
instead of encoding it in a bit field combined with the CTRL/MON type.
4) Added "l3" to a bunch of function and structure names to indicate
that they are now specific to L3 events instead of generic monitoring.
5) Struct mon_evt is also the source of truth for "can this event be
read from any CPU?". Other structures (mon_data and rmid_read) now
have pointers to mon_evt instead of their own field copied from
mon_evt.
6) Events that can be read on any CPU now bypass the
cpumask_any_housekeeping() path that would have resulted in an
IPI to the first CPU on a domain. mon_event_read() now directly
calls mon_event_count() for these events.
7) Renamed the per-mount hook and commented on (lack of) locking
by the caller.
8) Split the enumeration of telemetry events into easier to
review chunks with more comments in the code at each stage.
9) Simplified the intel_aet_read_event() code. No funky macros
to pick up parameters for the MMIO address calculation. Added
a sanity check that the computed MMIO register address is in
the range provided by the aggregator.
10) File system now owns the output format. Architecture cannot
make choices. Every event is hard-coded to be displayed as
integer or floating point.
11) Added additional options to the rdt= boot option for the user
to force opt-in or opt-out of telemetry events. Use these options
to solve the "how many RMIDs can be used?" issue.
12) Moved final calculation of available number of RMIDs to first
mount of resctrl file system and make it determine smallest value
across all mon_capable resources.
13) Version 2 of the patch series included extra files in the info/
directory to report some internal status values. V3 dropped that
entirely because I couldn't see a good way to cross the fs<->arch
boundary with extra architecture specific info files. Patches
29-30 are an RFC way to bring this back when the file system is
mounted with the "debug" option.
Background
----------
Telemetry features are being implemented in conjunction with the
IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
counts for various events to a collector in a nearby OOBMSM device to be
accumulated with counts for each <RMID, event> pair received from other
CPUs. Cores send event counts when the RMID value changes, or after each
2ms elapsed time.
Each OOBMSM device may implement multiple event collectors with each
servicing a subset of the logical CPUs on a package. In the initial
hardware implementation, there are two categories of events: energy
and perf.
1) Energy - Two counters
core_energy: This is an estimate of Joules consumed by each core. It is
calculated based on the types of instructions executed, not from a power
meter. This counter is useful to understand how much energy a workload
is consuming.
activity: This measures "accumulated dynamic capacitance". Users who
want to optimize energy consumption for a workload may use this rather
than core_energy because it provides consistent results independent of
any frequency or voltage changes that may occur during the runtime of
the application (e.g. entry/exit from turbo mode).
2) Performance - Seven counters
These are similar events to those available via the Linux "perf" tool,
but collected in a way with much lower overhead (no need to collect data
on every context switch).
stalls_llc_hit - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which hit in the LLC
c1_res - Counts the total C1 residency across all cores. The underlying
counter increments on 100MHz clock ticks
unhalted_core_cycles - Counts the total number of unhalted core clock
cycles
stalls_llc_miss - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which missed all the
local caches
c6_res - Counts the total C6 residency. The underlying counter increments
on crystal clock (25MHz) ticks
unhalted_ref_cycles - Counts the total number of unhalted reference clock
(TSC) cycles
uops_retired - Counts the total number of uops retired
The counters are arranged in groups in MMIO space of the OOBMSM device.
E.g. for the energy counters the layout is:
Offset: Counter
0x00 core energy for RMID 0
0x08 core activity for RMID 0
0x10 core energy for RMID 1
0x18 core activity for RMID 1
...
Enumeration
-----------
The only CPUID based enumeration for this feature is the legacy
CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
IA32_PQR_ASSOC MSR and the RMID field within it.
The OOBMSM driver discovers which features are present via
PCIe VSEC capabilities. Each feature is tagged with a unique
identifier. These identifiers indicate which XML description file from
https://github.com/intel/Intel-PMT describes which event counters are
available and their layout within the MMIO BAR space of the OOBMSM device.
Resctrl User Interface
----------------------
Because there may be multiple OOBMSM collection agents per processor
package, resctrl accumulates event counts from all agents on a package
and presents a single value to users. This will provide a consistent
user interface on future platforms that vary the number of collectors,
or the mappings from logical CPUs to collectors.
Users will continue to see the legacy monitoring files in the "L3"
directories and the telemetry files in the new "PERF_PKG" directories
(with each file providing the aggregated value from all OOBMSM collectors
on that package).
$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
├── mon_L3_00
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_01
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_PERF_PKG_00
│ ├── activity
│ ├── c1_res
│ ├── c6_res
│ ├── core_energy
│ ├── stalls_llc_hit
│ ├── stalls_llc_miss
│ ├── unhalted_core_cycles
│ ├── unhalted_ref_cycles
│ └── uops_retired
└── mon_PERF_PKG_01
├── activity
├── c1_res
├── c6_res
├── core_energy
├── stalls_llc_hit
├── stalls_llc_miss
├── unhalted_core_cycles
├── unhalted_ref_cycles
└── uops_retired
Resctrl Implementation
----------------------
The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
that returns an array of structures describing the per-RMID groups it
found from the VSEC enumeration. Linux looks at the unique identifiers
for each group and enables resctrl for all groups with known unique
identifiers.
The memory map for the counters for each <RMID, event> pair is described
by the XML file. This is too unwieldy to use in the Linux kernel, so a
simplified representation is built into the resctrl code. Note that the
counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
and IA32_QM_CTR MSRs. This means there is no need for cross-processor
calls to read counters from a CPU in a specific domain. The counters
can be read from any CPU.
High level description of code changes:
1) New scope RESCTRL_PACKAGE
2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
switch (r->rid) {
case RDT_RESOURCE_L3:
helper for L3
break;
case RDT_RESOURCE_PERF_PKG:
helper for PKG
break;
}
4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.
With only one platform providing this feature, it's tricky to tell
exactly where it is going to go. I've made the event definitions
platform specific (based on the unique ID from the VSEC enumeration). It
seems possible/likely that the list of events may change from generation
to generation.
I've picked names for events based on the descriptions in the XML file.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tony Luck (31):
x86,fs/resctrl: Drop rdt_mon_features variable
x86,fs/resctrl: Prepare for more monitor events
fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}()
fs/resctrl: Change how and when events are initialized
fs/resctrl: Set up Kconfig options for telemetry events
x86/rectrl: Fake OOBMSM interface
x86,fs/resctrl: Improve domain type checking
x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain
types
x86/resctrl: Change generic monitor functions to use struct
rdt_domain_hdr
x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
fs/resctrl: Improve handling for events that can be read from any CPU
fs/resctrl: Add support for additional monitor event display formats
fs/resctrl: Add an architectural hook called for each mount
x86/resctrl: Add and initialize rdt_resource for package scope core
monitor
x86/resctrl: Add first part of telemetry event enumeration
x86/resctrl: Add second part of telemetry event enumeration
x86/resctrl: Add third part of telemetry event enumeration
x86,fs/resctrl: Fill in details of Clearwater Forest events
x86/resctrl: Check for adequate MMIO space
x86/resctrl: Add fourth part of telemetry event enumeration
x86/resctrl: Read core telemetry events
x86,fs/resctrl: Handle domain creation/deletion for
RDT_RESOURCE_PERF_PKG
fs/resctrl: Add type define for PERF_PKG files
x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG
x86/resctrl: Add energy/perf choices to rdt boot option
x86/resctrl: Handle number of RMIDs supported by telemetry resources
x86,fs/resctrl: Fix RMID allocation for multiple monitor resources
fs/resctrl: Add interface for per-resource debug info files
x86/resctrl: Add info/PERF_PKG_MON/status file
x86/resctrl: Update Documentation for package events
.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/filesystems/resctrl.rst | 53 ++-
include/linux/resctrl.h | 51 ++-
include/linux/resctrl_types.h | 19 +
arch/x86/include/asm/resctrl.h | 16 -
.../cpu/resctrl/fake_intel_aet_features.h | 73 ++++
arch/x86/kernel/cpu/resctrl/internal.h | 35 +-
fs/resctrl/internal.h | 42 ++-
arch/x86/kernel/cpu/resctrl/core.c | 273 ++++++++++----
.../cpu/resctrl/fake_intel_aet_features.c | 95 +++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 343 ++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 61 ++--
fs/resctrl/ctrlmondata.c | 93 ++---
fs/resctrl/monitor.c | 269 +++++++++-----
fs/resctrl/rdtgroup.c | 221 +++++++----
arch/x86/Kconfig | 1 +
arch/x86/kernel/cpu/resctrl/Makefile | 2 +
drivers/platform/x86/intel/pmt/Kconfig | 7 +
18 files changed, 1283 insertions(+), 373 deletions(-)
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
base-repository: git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git
base-branch: mpam/move_to_fs/v9_final
base-commit: dc979ecda2982f7c09de81cde1ec902fdc8e202f
--
2.48.1
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:28 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events Tony Luck
` (29 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The fs/arch boundary is a little muddy for adding new monitor features.
Clean it up by making the mon_evt structure the source of all information
about each event. In this case replace the bitmap of enabled monitor
features with an "enabled" bit in the mon_evt structure.
Change architecture code to inform file system code which events are
available on a system with resctrl_enable_mon_event().
Replace the event and architecture specific:
resctrl_arch_is_llc_occupancy_enabled()
resctrl_arch_is_mbm_total_enabled()
resctrl_arch_is_mbm_local_enabled()
functions with calls to resctrl_is_mon_event_enabled() with the
appropriate QOS_L3_* enum resctrl_event_id.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +
arch/x86/include/asm/resctrl.h | 16 -------
fs/resctrl/internal.h | 4 ++
arch/x86/kernel/cpu/resctrl/core.c | 25 +++++++----
arch/x86/kernel/cpu/resctrl/monitor.c | 9 +---
fs/resctrl/ctrlmondata.c | 4 +-
fs/resctrl/monitor.c | 60 ++++++++++++++++-----------
fs/resctrl/rdtgroup.c | 18 ++++----
8 files changed, 71 insertions(+), 67 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 9ba771f2ddea..3c5d111aae65 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -372,6 +372,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
+void resctrl_enable_mon_event(enum resctrl_event_id evtid);
+bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
/**
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 189f885dcf3e..a59b3adb56cd 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -42,7 +42,6 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
extern bool rdt_alloc_capable;
extern bool rdt_mon_capable;
-extern unsigned int rdt_mon_features;
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
@@ -82,21 +81,6 @@ static inline void resctrl_arch_disable_mon(void)
static_branch_dec_cpuslocked(&rdt_enable_key);
}
-static inline bool resctrl_arch_is_llc_occupancy_enabled(void)
-{
- return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID));
-}
-
-static inline bool resctrl_arch_is_mbm_total_enabled(void)
-{
- return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID));
-}
-
-static inline bool resctrl_arch_is_mbm_local_enabled(void)
-{
- return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID));
-}
-
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 6dd2a74cf3ec..ff89a0ca130e 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -70,15 +70,19 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @evtid: event id
* @name: name of the event
* @configurable: true if the event is configurable
+ * @enabled: true if the event is enabled
* @list: entry in &rdt_resource->evt_list
*/
struct mon_evt {
enum resctrl_event_id evtid;
char *name;
bool configurable;
+ bool enabled;
struct list_head list;
};
+extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
+
/**
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 224bed28f341..819bc7a09327 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -401,13 +401,13 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
size_t tsize;
- if (resctrl_arch_is_mbm_total_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
tsize = sizeof(*hw_dom->arch_mbm_total);
hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
if (!hw_dom->arch_mbm_total)
return -ENOMEM;
}
- if (resctrl_arch_is_mbm_local_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
tsize = sizeof(*hw_dom->arch_mbm_local);
hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
if (!hw_dom->arch_mbm_local) {
@@ -860,15 +860,22 @@ static __init bool get_rdt_alloc_resources(void)
static __init bool get_rdt_mon_resources(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ bool ret = false;
- if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
- rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
- if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL))
- rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID);
- if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
- rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
+ if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
+ ret = true;
+ }
+ if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
+ ret = true;
+ }
+ if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
+ ret = true;
+ }
- if (!rdt_mon_features)
+ if (!ret)
return false;
return !rdt_get_mon_l3_config(r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3fc4d9f56f0d..fda579251dba 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -30,11 +30,6 @@
*/
bool rdt_mon_capable;
-/*
- * Global to indicate which monitoring events are enabled.
- */
-unsigned int rdt_mon_features;
-
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
static int snc_nodes_per_l3_cache = 1;
@@ -206,11 +201,11 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
- if (resctrl_arch_is_mbm_total_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
memset(hw_dom->arch_mbm_total, 0,
sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
memset(hw_dom->arch_mbm_local, 0,
sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
}
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index d56b78450a99..b17b60114afd 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -473,12 +473,12 @@ ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
rdt_last_cmd_clear();
if (!strcmp(buf, "mbm_local_bytes")) {
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
else
ret = -EINVAL;
} else if (!strcmp(buf, "mbm_total_bytes")) {
- if (resctrl_arch_is_mbm_total_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
else
ret = -EINVAL;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index bde2801289d3..7de4e219dba3 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -336,7 +336,7 @@ void free_rmid(u32 closid, u32 rmid)
entry = __rmid_entry(idx);
- if (resctrl_arch_is_llc_occupancy_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
add_rmid_to_limbo(entry);
else
list_add_tail(&entry->list, &rmid_free_lru);
@@ -635,10 +635,10 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
* This is protected from concurrent reads from user as both
* the user and overflow handler hold the global mutex.
*/
- if (resctrl_arch_is_mbm_total_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_TOTAL_EVENT_ID);
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_LOCAL_EVENT_ID);
}
@@ -842,20 +842,33 @@ static void dom_data_exit(struct rdt_resource *r)
mutex_unlock(&rdtgroup_mutex);
}
-static struct mon_evt llc_occupancy_event = {
- .name = "llc_occupancy",
- .evtid = QOS_L3_OCCUP_EVENT_ID,
+struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
+ [QOS_L3_OCCUP_EVENT_ID] = {
+ .name = "llc_occupancy",
+ .evtid = QOS_L3_OCCUP_EVENT_ID,
+ },
+ [QOS_L3_MBM_TOTAL_EVENT_ID] = {
+ .name = "mbm_total_bytes",
+ .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
+ },
+ [QOS_L3_MBM_LOCAL_EVENT_ID] = {
+ .name = "mbm_local_bytes",
+ .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
+ },
};
-static struct mon_evt mbm_total_event = {
- .name = "mbm_total_bytes",
- .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
-};
+void resctrl_enable_mon_event(enum resctrl_event_id evtid)
+{
+ if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
+ return;
-static struct mon_evt mbm_local_event = {
- .name = "mbm_local_bytes",
- .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
-};
+ mon_event_all[evtid].enabled = true;
+}
+
+bool resctrl_is_mon_event_enabled(enum resctrl_event_id evtid)
+{
+ return evtid < QOS_NUM_EVENTS && mon_event_all[evtid].enabled;
+}
/*
* Initialize the event list for the resource.
@@ -866,14 +879,13 @@ static struct mon_evt mbm_local_event = {
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
+ enum resctrl_event_id evt;
+
INIT_LIST_HEAD(&r->evt_list);
- if (resctrl_arch_is_llc_occupancy_enabled())
- list_add_tail(&llc_occupancy_event.list, &r->evt_list);
- if (resctrl_arch_is_mbm_total_enabled())
- list_add_tail(&mbm_total_event.list, &r->evt_list);
- if (resctrl_arch_is_mbm_local_enabled())
- list_add_tail(&mbm_local_event.list, &r->evt_list);
+ for (evt = 0; evt < QOS_NUM_EVENTS; evt++)
+ if (mon_event_all[evt].enabled)
+ list_add_tail(&mon_event_all[evt].list, &r->evt_list);
}
/**
@@ -903,19 +915,19 @@ int resctrl_mon_resource_init(void)
l3_mon_evt_init(r);
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
- mbm_total_event.configurable = true;
+ mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) {
- mbm_local_event.configurable = true;
+ mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_local_bytes_config",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
mba_mbps_default_event = QOS_L3_MBM_LOCAL_EVENT_ID;
- else if (resctrl_arch_is_mbm_total_enabled())
+ else if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
return 0;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 07f91d18c1b8..4a092c305255 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -123,8 +123,8 @@ void rdt_staged_configs_clear(void)
static bool resctrl_is_mbm_enabled(void)
{
- return (resctrl_arch_is_mbm_total_enabled() ||
- resctrl_arch_is_mbm_local_enabled());
+ return (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID) ||
+ resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID));
}
static bool resctrl_is_mbm_event(int e)
@@ -196,7 +196,7 @@ static int closid_alloc(void)
lockdep_assert_held(&rdtgroup_mutex);
if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID) &&
- resctrl_arch_is_llc_occupancy_enabled()) {
+ resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
cleanest_closid = resctrl_find_cleanest_closid();
if (cleanest_closid < 0)
return cleanest_closid;
@@ -4046,7 +4046,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
- if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
/*
* When a package is going down, forcefully
* decrement rmid->ebusy. There is no way to know
@@ -4082,12 +4082,12 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize;
- if (resctrl_arch_is_llc_occupancy_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
- if (resctrl_arch_is_mbm_total_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
tsize = sizeof(*d->mbm_total);
d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_total) {
@@ -4095,7 +4095,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
return -ENOMEM;
}
}
- if (resctrl_arch_is_mbm_local_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
tsize = sizeof(*d->mbm_local);
d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_local) {
@@ -4140,7 +4140,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
RESCTRL_PICK_ANY_CPU);
}
- if (resctrl_arch_is_llc_occupancy_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
/*
@@ -4215,7 +4215,7 @@ void resctrl_offline_cpu(unsigned int cpu)
cancel_delayed_work(&d->mbm_over);
mbm_setup_overflow_handler(d, 0, cpu);
}
- if (resctrl_arch_is_llc_occupancy_enabled() &&
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
cpu == d->cqm_work_cpu && has_busy_rmid(d)) {
cancel_delayed_work(&d->cqm_limbo);
cqm_setup_limbo_handler(d, 0, cpu);
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
2025-04-29 0:33 ` [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:30 ` Reinette Chatre
2025-05-09 15:02 ` Peter Newman
2025-04-29 0:33 ` [PATCH v4 03/31] fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}() Tony Luck
` (28 subsequent siblings)
30 siblings, 2 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There's a rule in computer programming that objects appear zero,
once, or many times. So code accordingly.
There are two MBM events and resctrl is coded with a lot of
if (local)
do one thing
if (total)
do a different thing
Change the rdt_ctrl_domain and rdt_hw_mon_domain structures to hold
arrays of pointers to per event data instead of explicit fields for
total and local bandwidth.
Simplify the code by coding for many events using loops on
which are enabled.
Move resctrl_is_mbm_event() to <linux/resctrl.h> so it
can be used more widely. Also provide a for_each_mbm_event()
helper macro.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 15 +++++---
include/linux/resctrl_types.h | 3 ++
arch/x86/kernel/cpu/resctrl/internal.h | 6 ++--
arch/x86/kernel/cpu/resctrl/core.c | 38 ++++++++++----------
arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++++++--------
fs/resctrl/monitor.c | 13 ++++---
fs/resctrl/rdtgroup.c | 48 ++++++++++++--------------
7 files changed, 84 insertions(+), 72 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 3c5d111aae65..cef9b0ed984c 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -161,8 +161,7 @@ struct rdt_ctrl_domain {
* @hdr: common header for different domain types
* @ci: cache info for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
- * @mbm_total: saved state for MBM total bandwidth
- * @mbm_local: saved state for MBM local bandwidth
+ * @mbm_states: saved state for each QOS MBM event
* @mbm_over: worker to periodically read MBM h/w counters
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
@@ -172,8 +171,7 @@ struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
struct cacheinfo *ci;
unsigned long *rmid_busy_llc;
- struct mbm_state *mbm_total;
- struct mbm_state *mbm_local;
+ struct mbm_state *mbm_states[QOS_NUM_MBM_EVENTS];
struct delayed_work mbm_over;
struct delayed_work cqm_limbo;
int mbm_work_cpu;
@@ -376,6 +374,15 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
+static inline bool resctrl_is_mbm_event(int e)
+{
+ return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
+ e <= QOS_L3_MBM_LOCAL_EVENT_ID);
+}
+
+#define for_each_mbm_event(evt) \
+ for (evt = QOS_L3_MBM_TOTAL_EVENT_ID; evt <= QOS_L3_MBM_LOCAL_EVENT_ID; evt++)
+
/**
* resctrl_arch_mon_event_config_write() - Write the config for an event.
* @config_info: struct resctrl_mon_config_info describing the resource, domain
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index a25fb9c4070d..5ef14a24008c 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -47,4 +47,7 @@ enum resctrl_event_id {
QOS_NUM_EVENTS,
};
+#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
+#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
+
#endif /* __LINUX_RESCTRL_TYPES_H */
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5e3c41b36437..02b535c828f3 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -54,15 +54,13 @@ struct rdt_hw_ctrl_domain {
* struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
* a resource for a monitor function
* @d_resctrl: Properties exposed to the resctrl file system
- * @arch_mbm_total: arch private state for MBM total bandwidth
- * @arch_mbm_local: arch private state for MBM local bandwidth
+ * @arch_mbm_states: arch private state for each MBM event
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
struct rdt_hw_mon_domain {
struct rdt_mon_domain d_resctrl;
- struct arch_mbm_state *arch_mbm_total;
- struct arch_mbm_state *arch_mbm_local;
+ struct arch_mbm_state *arch_mbm_states[QOS_NUM_MBM_EVENTS];
};
static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 819bc7a09327..e5c91d21e8f7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -364,8 +364,8 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
- kfree(hw_dom->arch_mbm_total);
- kfree(hw_dom->arch_mbm_local);
+ for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++)
+ kfree(hw_dom->arch_mbm_states[i]);
kfree(hw_dom);
}
@@ -399,25 +399,27 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
*/
static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
- size_t tsize;
-
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
- tsize = sizeof(*hw_dom->arch_mbm_total);
- hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
- if (!hw_dom->arch_mbm_total)
- return -ENOMEM;
- }
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
- tsize = sizeof(*hw_dom->arch_mbm_local);
- hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
- if (!hw_dom->arch_mbm_local) {
- kfree(hw_dom->arch_mbm_total);
- hw_dom->arch_mbm_total = NULL;
- return -ENOMEM;
- }
+ size_t tsize = sizeof(struct arch_mbm_state);
+ enum resctrl_event_id evt;
+ int idx;
+
+ for_each_mbm_event(evt) {
+ if (!resctrl_is_mon_event_enabled(evt))
+ continue;
+ idx = MBM_EVENT_IDX(evt);
+ hw_dom->arch_mbm_states[idx] = kcalloc(num_rmid, tsize, GFP_KERNEL);
+ if (!hw_dom->arch_mbm_states[idx])
+ goto cleanup;
}
return 0;
+cleanup:
+ while (--idx >= 0) {
+ kfree(hw_dom->arch_mbm_states[idx]);
+ hw_dom->arch_mbm_states[idx] = NULL;
+ }
+
+ return -ENOMEM;
}
static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index fda579251dba..bf7fde07846b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -160,18 +160,21 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
u32 rmid,
enum resctrl_event_id eventid)
{
+ struct arch_mbm_state *state;
+
switch (eventid) {
- case QOS_L3_OCCUP_EVENT_ID:
- return NULL;
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &hw_dom->arch_mbm_total[rmid];
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &hw_dom->arch_mbm_local[rmid];
default:
/* Never expect to get here */
WARN_ON_ONCE(1);
+ fallthrough;
+ case QOS_L3_OCCUP_EVENT_ID:
return NULL;
+ case QOS_L3_MBM_TOTAL_EVENT_ID:
+ case QOS_L3_MBM_LOCAL_EVENT_ID:
+ state = hw_dom->arch_mbm_states[MBM_EVENT_IDX(eventid)];
}
+
+ return state ? &state[rmid] : NULL;
}
void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
@@ -200,14 +203,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
-
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- memset(hw_dom->arch_mbm_total, 0,
- sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
-
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- memset(hw_dom->arch_mbm_local, 0,
- sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
+ enum resctrl_event_id evt;
+ int idx;
+
+ for_each_mbm_event(evt) {
+ idx = MBM_EVENT_IDX(evt);
+ if (!hw_dom->arch_mbm_states[idx])
+ continue;
+ memset(hw_dom->arch_mbm_states[idx], 0,
+ sizeof(struct arch_mbm_state) * r->num_rmid);
+ }
}
static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 7de4e219dba3..ef33970166af 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -346,15 +346,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ struct mbm_state *states;
- switch (evtid) {
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &d->mbm_total[idx];
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &d->mbm_local[idx];
- default:
+ if (!resctrl_is_mbm_event(evtid))
return NULL;
- }
+
+ states = d->mbm_states[MBM_EVENT_IDX(evtid)];
+
+ return states ? &states[idx] : NULL;
}
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 4a092c305255..c06752dfcb7c 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -127,12 +127,6 @@ static bool resctrl_is_mbm_enabled(void)
resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID));
}
-static bool resctrl_is_mbm_event(int e)
-{
- return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
- e <= QOS_L3_MBM_LOCAL_EVENT_ID);
-}
-
/*
* Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap
* of free CLOSIDs.
@@ -4019,8 +4013,10 @@ static void rdtgroup_setup_default(void)
static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
- kfree(d->mbm_total);
- kfree(d->mbm_local);
+ for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++) {
+ kfree(d->mbm_states[i]);
+ d->mbm_states[i] = NULL;
+ }
}
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
@@ -4080,32 +4076,34 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
- size_t tsize;
+ size_t tsize = sizeof(struct mbm_state);
+ enum resctrl_event_id evt;
+ int idx;
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
- tsize = sizeof(*d->mbm_total);
- d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
- if (!d->mbm_total) {
- bitmap_free(d->rmid_busy_llc);
- return -ENOMEM;
- }
- }
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
- tsize = sizeof(*d->mbm_local);
- d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
- if (!d->mbm_local) {
- bitmap_free(d->rmid_busy_llc);
- kfree(d->mbm_total);
- return -ENOMEM;
- }
+
+ for_each_mbm_event(evt) {
+ if (!resctrl_is_mon_event_enabled(evt))
+ continue;
+ idx = MBM_EVENT_IDX(evt);
+ d->mbm_states[idx] = kcalloc(idx_limit, tsize, GFP_KERNEL);
+ if (!d->mbm_states[idx])
+ goto cleanup;
}
return 0;
+cleanup:
+ bitmap_free(d->rmid_busy_llc);
+ while (--idx >= 0) {
+ kfree(d->mbm_states[idx]);
+ d->mbm_states[idx] = NULL;
+ }
+
+ return -ENOMEM;
}
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 03/31] fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}()
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
2025-04-29 0:33 ` [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable Tony Luck
2025-04-29 0:33 ` [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:31 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 04/31] fs/resctrl: Change how and when events are initialized Tony Luck
` (27 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
These routines hard-code the two legacy mbm events.
Change to allow for other mbm events in the future.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 4 ++++
fs/resctrl/ctrlmondata.c | 39 +++++++++------------------------------
fs/resctrl/monitor.c | 16 ++++++++++++++++
3 files changed, 29 insertions(+), 30 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index ff89a0ca130e..6029b3285dd3 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -393,6 +393,10 @@ bool closid_allocated(unsigned int closid);
int resctrl_find_cleanest_closid(void);
+enum resctrl_event_id resctrl_get_mon_event_by_name(char *name);
+
+char *resctrl_mon_event_name(enum resctrl_event_id evt);
+
#ifdef CONFIG_RESCTRL_FS_PSEUDO_LOCK
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index b17b60114afd..53388281ff7d 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -472,26 +472,17 @@ ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
}
rdt_last_cmd_clear();
- if (!strcmp(buf, "mbm_local_bytes")) {
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
- else
- ret = -EINVAL;
- } else if (!strcmp(buf, "mbm_total_bytes")) {
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
- else
- ret = -EINVAL;
- } else {
+ ret = resctrl_get_mon_event_by_name(buf);
+ if (ret < 0 || !resctrl_is_mon_event_enabled(ret) || !resctrl_is_mbm_event(ret)) {
+ rdt_last_cmd_printf("Unsupported event id '%s'\n", buf);
ret = -EINVAL;
+ } else {
+ rdtgrp->mba_mbps_event = ret;
}
- if (ret)
- rdt_last_cmd_printf("Unsupported event id '%s'\n", buf);
-
rdtgroup_kn_unlock(of->kn);
- return ret ?: nbytes;
+ return ret < 0 ? ret : nbytes;
}
int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
@@ -502,22 +493,10 @@ int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
rdtgrp = rdtgroup_kn_lock_live(of->kn);
- if (rdtgrp) {
- switch (rdtgrp->mba_mbps_event) {
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- seq_puts(s, "mbm_local_bytes\n");
- break;
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- seq_puts(s, "mbm_total_bytes\n");
- break;
- default:
- pr_warn_once("Bad event %d\n", rdtgrp->mba_mbps_event);
- ret = -EINVAL;
- break;
- }
- } else {
+ if (rdtgrp)
+ seq_printf(s, "%s\n", resctrl_mon_event_name(rdtgrp->mba_mbps_event));
+ else
ret = -ENOENT;
- }
rdtgroup_kn_unlock(of->kn);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index ef33970166af..625cd328c790 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -869,6 +869,22 @@ bool resctrl_is_mon_event_enabled(enum resctrl_event_id evtid)
return evtid < QOS_NUM_EVENTS && mon_event_all[evtid].enabled;
}
+enum resctrl_event_id resctrl_get_mon_event_by_name(char *name)
+{
+ enum resctrl_event_id evt;
+
+ for (evt = 0; evt < QOS_NUM_EVENTS; evt++)
+ if (mon_event_all[evt].name && !strcmp(name, mon_event_all[evt].name))
+ return evt;
+
+ return -EINVAL;
+}
+
+char *resctrl_mon_event_name(enum resctrl_event_id evt)
+{
+ return evt < QOS_NUM_EVENTS && mon_event_all[evt].name ? mon_event_all[evt].name : "unknown";
+}
+
/*
* Initialize the event list for the resource.
*
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 04/31] fs/resctrl: Change how and when events are initialized
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (2 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 03/31] fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}() Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:31 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
` (26 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Existing code assumes that all monitor events are associated with
the RDT_RESOURCE_L3 resource. Also that all event enumeration is
complete during early resctrl initialization. Neither of these
assumptions remain true for new events.
Each resource must include a list of enabled events that is used
to add appropriately named files when creating mon_data directories
and to for the contents of "info/{resource}_MON/mon_features" file.
Move the building of enabled event lists for each resource from
resctrl_mon_resource_init() to rdt_get_tree() to delay it until
mount of the resctrl file system.
Add a new field to struct mon_evt to record which resource each
event is associated with so that events are added to the correct
resource event list.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 4 ++++
fs/resctrl/monitor.c | 33 ++++++++++++++++++++++-----------
fs/resctrl/rdtgroup.c | 2 ++
3 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 6029b3285dd3..b69170760316 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -68,6 +68,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
/**
* struct mon_evt - Entry in the event list of a resource
* @evtid: event id
+ * @rid: index of the resource for this event
* @name: name of the event
* @configurable: true if the event is configurable
* @enabled: true if the event is enabled
@@ -75,6 +76,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
*/
struct mon_evt {
enum resctrl_event_id evtid;
+ enum resctrl_res_level rid;
char *name;
bool configurable;
bool enabled;
@@ -397,6 +399,8 @@ enum resctrl_event_id resctrl_get_mon_event_by_name(char *name);
char *resctrl_mon_event_name(enum resctrl_event_id evt);
+void resctrl_init_mon_events(void);
+
#ifdef CONFIG_RESCTRL_FS_PSEUDO_LOCK
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 625cd328c790..a5a523f73249 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -845,14 +845,17 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
[QOS_L3_OCCUP_EVENT_ID] = {
.name = "llc_occupancy",
.evtid = QOS_L3_OCCUP_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
},
[QOS_L3_MBM_TOTAL_EVENT_ID] = {
.name = "mbm_total_bytes",
.evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
},
[QOS_L3_MBM_LOCAL_EVENT_ID] = {
.name = "mbm_local_bytes",
.evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
},
};
@@ -886,21 +889,31 @@ char *resctrl_mon_event_name(enum resctrl_event_id evt)
}
/*
- * Initialize the event list for the resource.
+ * Initialize the event list for all mon_capable resources.
*
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
+ * Called on each mount of the resctrl file system when all
+ * events have been enumerated. Only needs to build the per-resource
+ * event lists once.
*/
-static void l3_mon_evt_init(struct rdt_resource *r)
+void resctrl_init_mon_events(void)
{
enum resctrl_event_id evt;
+ struct rdt_resource *r;
+ static bool only_once;
+
+ if (only_once)
+ return;
+ only_once = true;
- INIT_LIST_HEAD(&r->evt_list);
+ for_each_mon_capable_rdt_resource(r)
+ INIT_LIST_HEAD(&r->evt_list);
- for (evt = 0; evt < QOS_NUM_EVENTS; evt++)
- if (mon_event_all[evt].enabled)
- list_add_tail(&mon_event_all[evt].list, &r->evt_list);
+ for (evt = 0; evt < QOS_NUM_EVENTS; evt++) {
+ if (!mon_event_all[evt].enabled)
+ continue;
+ r = resctrl_arch_get_resource(mon_event_all[evt].rid);
+ list_add_tail(&mon_event_all[evt].list, &r->evt_list);
+ }
}
/**
@@ -927,8 +940,6 @@ int resctrl_mon_resource_init(void)
if (ret)
return ret;
- l3_mon_evt_init(r);
-
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index c06752dfcb7c..e66dc041be5f 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2591,6 +2591,8 @@ static int rdt_get_tree(struct fs_context *fc)
goto out;
}
+ resctrl_init_mon_events();
+
ret = rdtgroup_setup_root(ctx);
if (ret)
goto out;
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (3 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 04/31] fs/resctrl: Change how and when events are initialized Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:32 ` Reinette Chatre
2025-05-10 9:58 ` Chen, Yu C
2025-04-29 0:33 ` [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface Tony Luck
` (25 subsequent siblings)
30 siblings, 2 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Intel RMID based telemetry events are counted by each CPU core
and then aggregated by one or more per-socket micro controllers.
Enumeration support is provided by the Intel PMT subsystem.
N.B. Patches for the Intel PMT system are still in progress.
They will define an INTEL_PMT_DISCOVERY Kconfig symbol that
will be one of the dependencies. This is commented out for
now. Final version will include this dependency.
arch/x86 selects this option based on:
X86_64: Counter registers are in MMIO space. There is no readq()
function on 32-bit. Emulation is possible with readl(), but there
are races. Running 32-bit kernels on systems that support this
feature seems pointless.
CPU_SUP_INTEL: It is an Intel specific feature.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/Kconfig | 1 +
drivers/platform/x86/intel/pmt/Kconfig | 7 +++++++
2 files changed, 8 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5a09acf41c8e..19107fdb4264 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -508,6 +508,7 @@ config X86_CPU_RESCTRL
bool "x86 CPU resource control support"
depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
depends on MISC_FILESYSTEMS
+ select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL)
select ARCH_HAS_CPU_RESCTRL
select RESCTRL_FS
select RESCTRL_FS_PSEUDO_LOCK
diff --git a/drivers/platform/x86/intel/pmt/Kconfig b/drivers/platform/x86/intel/pmt/Kconfig
index e916fc966221..3a8ce39d1004 100644
--- a/drivers/platform/x86/intel/pmt/Kconfig
+++ b/drivers/platform/x86/intel/pmt/Kconfig
@@ -38,3 +38,10 @@ config INTEL_PMT_CRASHLOG
To compile this driver as a module, choose M here: the module
will be called intel_pmt_crashlog.
+
+config INTEL_AET_RESCTRL
+ depends on INTEL_PMT_TELEMETRY # && INTEL_PMT_DISCOVERY
+ bool
+ help
+ Architecture config should "select" this option to enable
+ support for RMID telemetry events in the resctrl file system.
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (4 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-04-30 23:02 ` Luck, Tony
2025-05-08 3:33 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 07/31] x86,fs/resctrl: Improve domain type checking Tony Luck
` (24 subsequent siblings)
30 siblings, 2 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Real version is coming soon ... this is here so the remaining parts
will build (and run ... assuming a 2 socket system that supports RDT
monitoring ... only missing part is that the event counters just
report fixed values).
Just for ease of testing and RFC discussion.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
.../cpu/resctrl/fake_intel_aet_features.h | 73 ++++++++++++++
.../cpu/resctrl/fake_intel_aet_features.c | 95 +++++++++++++++++++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
3 files changed, 169 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
new file mode 100644
index 000000000000..c835c4108abc
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Bits stolen from OOBMSM VSEC discovery code */
+
+enum pmt_feature_id {
+ FEATURE_INVALID = 0x0,
+ FEATURE_PER_CORE_PERF_TELEM = 0x1,
+ FEATURE_PER_CORE_ENV_TELEM = 0x2,
+ FEATURE_PER_RMID_PERF_TELEM = 0x3,
+ FEATURE_ACCEL_TELEM = 0x4,
+ FEATURE_UNCORE_TELEM = 0x5,
+ FEATURE_CRASH_LOG = 0x6,
+ FEATURE_PETE_LOG = 0x7,
+ FEATURE_TPMI_CTRL = 0x8,
+ FEATURE_RESERVED = 0x9,
+ FEATURE_TRACING = 0xA,
+ FEATURE_PER_RMID_ENERGY_TELEM = 0xB,
+ FEATURE_MAX = 0xB,
+};
+
+/**
+ * struct oobmsm_plat_info - Platform information for a device instance
+ * @cdie_mask: Mask of all compute dies in the partition
+ * @package_id: CPU Package id
+ * @partition: Package partition id when multiple VSEC PCI devices per package
+ * @segment: PCI segment ID
+ * @bus_number: PCI bus number
+ * @device_number: PCI device number
+ * @function_number: PCI function number
+ *
+ * Structure to store platform data for a OOBMSM device instance.
+ */
+struct oobmsm_plat_info {
+ u16 cdie_mask;
+ u8 package_id;
+ u8 partition;
+ u8 segment;
+ u8 bus_number;
+ u8 device_number;
+ u8 function_number;
+};
+
+enum oobmsm_supplier_type {
+ OOBMSM_SUP_PLAT_INFO,
+ OOBMSM_SUP_DISC_INFO,
+ OOBMSM_SUP_S3M_SIMICS,
+ OOBMSM_SUP_TYPE_MAX
+};
+
+struct oobmsm_mapping_supplier {
+ struct device *supplier_dev[OOBMSM_SUP_TYPE_MAX];
+ struct oobmsm_plat_info plat_info;
+ unsigned long features;
+};
+
+struct telemetry_region {
+ struct oobmsm_plat_info plat_info;
+ void __iomem *addr;
+ size_t size;
+ u32 guid;
+ u32 num_rmids;
+};
+
+struct pmt_feature_group {
+ enum pmt_feature_id id;
+ int count;
+ struct kref kref;
+ struct telemetry_region regions[];
+};
+
+struct pmt_feature_group *intel_pmt_get_regions_by_feature(enum pmt_feature_id id);
+
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group);
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
new file mode 100644
index 000000000000..22b7c02a538c
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/cleanup.h>
+#include <linux/minmax.h>
+#include <linux/slab.h>
+#include "fake_intel_aet_features.h"
+#include <linux/intel_vsec.h>
+#include <linux/resctrl.h>
+
+#include "internal.h"
+
+/*
+ * Amount of memory for each fake MMIO space
+ * Magic numbers here match values for XML ID 0x26696143 and 0x26557651
+ * 576: Number of RMIDs
+ * 2: Energy events in 0x26557651
+ * 7: Perf events in 0x26696143
+ * 3: Qwords for status counters after the event counters
+ * 8: Bytes for each counter
+ */
+
+#define ENERGY_QWORDS ((576 * 2) + 3)
+#define ENERGY_SIZE (ENERGY_QWORDS * 8)
+#define PERF_QWORDS ((576 * 7) + 3)
+#define PERF_SIZE (PERF_QWORDS * 8)
+
+static long pg[4 * ENERGY_QWORDS + 2 * PERF_QWORDS];
+
+/*
+ * Fill the fake MMIO space with all different values,
+ * all with BIT(63) set to indicate valid entries.
+ */
+static int __init fill(void)
+{
+ u64 val = 0;
+
+ for (int i = 0; i < sizeof(pg); i += sizeof(val)) {
+ pg[i / sizeof(val)] = BIT_ULL(63) + val;
+ val++;
+ }
+ return 0;
+}
+device_initcall(fill);
+
+#define PKG_REGION(_entry, _guid, _addr, _size, _pkg, _num_rmids) \
+ [_entry] = { .guid = _guid, .addr = (void __iomem *)_addr, \
+ .num_rmids = _num_rmids, \
+ .size = _size, .plat_info = { .package_id = _pkg }}
+
+/*
+ * Set up a fake return for call to:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
+ * Pretend there are two aggregators on each of the sockets to test
+ * the code that sums over multiple aggregators.
+ */
+static struct pmt_feature_group fake_energy = {
+ .count = 4,
+ .regions = {
+ PKG_REGION(0, 0x26696143, &pg[0 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
+ PKG_REGION(1, 0x26696143, &pg[1 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
+ PKG_REGION(2, 0x26696143, &pg[2 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64),
+ PKG_REGION(3, 0x26696143, &pg[3 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64)
+ }
+};
+
+/*
+ * Fake return for:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
+ */
+static struct pmt_feature_group fake_perf = {
+ .count = 2,
+ .regions = {
+ PKG_REGION(0, 0x26557651, &pg[4 * ENERGY_QWORDS + 0 * PERF_QWORDS], PERF_SIZE, 0, 576),
+ PKG_REGION(1, 0x26557651, &pg[4 * ENERGY_QWORDS + 1 * PERF_QWORDS], PERF_SIZE, 1, 576)
+ }
+};
+
+struct pmt_feature_group *
+intel_pmt_get_regions_by_feature(enum pmt_feature_id id)
+{
+ switch (id) {
+ case FEATURE_PER_RMID_ENERGY_TELEM:
+ return &fake_energy;
+ case FEATURE_PER_RMID_PERF_TELEM:
+ return &fake_perf;
+ default:
+ return ERR_PTR(-ENOENT);
+ }
+}
+
+/*
+ * Nothing needed for the "put" function.
+ */
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group)
+{
+}
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index d8a04b195da2..28ae1c88b2ac 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_INTEL_AET_RESCTRL) += fake_intel_aet_features.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
# To allow define_trace.h's recursive include:
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 07/31] x86,fs/resctrl: Improve domain type checking
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (5 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:36 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 08/31] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
` (23 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The rdt_domain_hdr structure is used in both control and monitor
domain structures to provide common methods for operations such as
adding a CPU to a domain, removing a CPU from a domain, accessing
the mask of all CPUs in a domain.
The "type" field provides a simple check whether a domain is a
control or monitor domain so that programming errors operating
on domains will be quickly caught.
To prepare for additional domain types that depend on the rdt_resource
to which they are connected add the resource id into the header
and check that in addition to the type.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 9 +++++++++
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++----
fs/resctrl/ctrlmondata.c | 2 +-
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index cef9b0ed984c..e700f58b5af5 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -131,15 +131,24 @@ enum resctrl_domain_type {
* @list: all instances of this resource
* @id: unique id for this instance
* @type: type of this instance
+ * @rid: index of resource for this domain
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
enum resctrl_domain_type type;
+ enum resctrl_res_level rid;
struct cpumask cpu_mask;
};
+static inline bool check_domain_header(struct rdt_domain_hdr *hdr,
+ enum resctrl_domain_type type,
+ enum resctrl_res_level rid)
+{
+ return !!WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
+}
+
/**
* struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
* @hdr: common header for different domain types
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e5c91d21e8f7..bdd4d08a3912 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -456,7 +456,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (check_domain_header(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -473,6 +473,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_CTRL_DOMAIN;
+ d->hdr.rid = r->rid;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
rdt_domain_reconfigure_cdp(r);
@@ -511,7 +512,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
@@ -526,6 +527,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = r->rid;
d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!d->ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -581,7 +583,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (check_domain_header(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -627,7 +629,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 53388281ff7d..3cbacfe52430 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -616,7 +616,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
* the resource to find the domain with "domid".
*/
hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+ if (!hdr || check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid)) {
ret = -ENOENT;
goto out;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 08/31] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (6 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 07/31] x86,fs/resctrl: Improve domain type checking Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:37 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 09/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
` (22 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
To prepare for additional types of monitoring domains, move all the
L3 specific initialization into a helper function.
Rename several functions to mark that they are specific to the L3 path.
arch_mon_domain_online -> arch_l3_mon_domain_online
mon_domain_free -> free_l3_mon_domain
arch_mon_domain_online -> arch_l3_mon_domain_online
domain_setup_mon_state -> domain_setup_l3_mon_state
resctrl_online_mon_domain() is going to share some code with new
reources, so keeps the same name, but include a check for
RDT_RESOURCE_L3.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 69 +++++++++++++++-----------
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
fs/resctrl/rdtgroup.c | 11 ++--
4 files changed, 50 insertions(+), 34 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 02b535c828f3..b563406b4996 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -122,7 +122,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
extern struct rdt_hw_resource rdt_resources_all[];
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index bdd4d08a3912..d48cdc85a86d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -362,7 +362,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
+static void free_l3_mon_domain(struct rdt_hw_mon_domain *hw_dom)
{
for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++)
kfree(hw_dom->arch_mbm_states[i]);
@@ -493,33 +493,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
}
}
-static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct list_head *add_pos = NULL;
struct rdt_hw_mon_domain *hw_dom;
- struct rdt_domain_hdr *hdr;
struct rdt_mon_domain *d;
int err;
- lockdep_assert_held(&domain_list_lock);
-
- if (id < 0) {
- pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->mon_scope, r->name);
- return;
- }
-
- hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
- if (hdr) {
- if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
-
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- return;
- }
-
hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
if (!hw_dom)
return;
@@ -531,15 +510,15 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!d->ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
- mon_domain_free(hw_dom);
+ free_l3_mon_domain(hw_dom);
return;
}
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- arch_mon_domain_online(r, d);
+ arch_l3_mon_domain_online(r, d);
if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- mon_domain_free(hw_dom);
+ free_l3_mon_domain(hw_dom);
return;
}
@@ -549,7 +528,41 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ free_l3_mon_domain(hw_dom);
+ }
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_domain_hdr *hdr;
+ struct rdt_mon_domain *d;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ return;
+ }
+
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ setup_l3_mon_domain(cpu, id, r, add_pos);
+ break;
+ default:
+ WARN_ON_ONCE(1);
}
}
@@ -640,7 +653,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
resctrl_offline_mon_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ free_l3_mon_domain(hw_dom);
return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index bf7fde07846b..d1f659dd6109 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -271,7 +271,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
* must adjust RMID counter numbers based on SNC node. See
* logical_rmid_to_physical_rmid() for code that does this.
*/
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
{
if (snc_nodes_per_l3_cache > 1)
msr_clear_bit(MSR_RMID_SNC_CONFIG, 0);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index e66dc041be5f..a0d2be84832c 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4063,7 +4063,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
}
/**
- * domain_setup_mon_state() - Initialise domain monitoring structures.
+ * domain_setup_l3_mon_state() - Initialise domain monitoring structures.
* @r: The resource for the newly online domain.
* @d: The newly online domain.
*
@@ -4075,7 +4075,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
*
* Returns 0 for success, or -ENOMEM.
*/
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
+static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(struct mbm_state);
@@ -4126,11 +4126,14 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
{
- int err;
+ int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
- err = domain_setup_mon_state(r, d);
+ if (r->rid != RDT_RESOURCE_L3)
+ goto out_unlock;
+
+ err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 09/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (7 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 08/31] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:37 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 10/31] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
` (21 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The RDT_RESOURCE_L3 resource carries a lot of state in the domain
structures which needs to be dealt with when a domain is taken offline
by removing the last CPU in the domain.
Refactor so all the L3 processing is separated from general actions of
clearing the CPU bit in the mask and removing directories from mon_data.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 14 ++++++++------
fs/resctrl/rdtgroup.c | 5 ++++-
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d48cdc85a86d..525439029865 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -645,17 +645,19 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- hw_dom = resctrl_to_arch_mon_dom(d);
+ cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
+ return;
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
free_l3_mon_domain(hw_dom);
-
- return;
+ break;
}
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a0d2be84832c..a65f3e16bdab 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4042,6 +4042,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
if (resctrl_mounted && resctrl_arch_mon_capable())
rmdir_mondata_subdir_allrdtgrp(r, d);
+ if (r->rid != RDT_RESOURCE_L3)
+ goto done;
+
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4058,7 +4061,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
}
domain_destroy_mon_state(d);
-
+done:
mutex_unlock(&rdtgroup_mutex);
}
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 10/31] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (8 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 09/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:38 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 11/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
` (20 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Functions that don't need the internal details of the rdt_mon_domain
can operate on just the rdt_domain_hdr.
Add sanity checks where container_of() is used to find the surrounding
domain structure that hdr has the expected type.
Simplify code that uses "d->hdr." to "hdr->" where possible.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 4 +-
arch/x86/kernel/cpu/resctrl/core.c | 19 +++----
fs/resctrl/rdtgroup.c | 82 +++++++++++++++++++++---------
3 files changed, 68 insertions(+), 37 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index e700f58b5af5..bb55c449adc4 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -444,9 +444,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 525439029865..9c78828ae32f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -460,7 +460,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ cpumask_set_cpu(cpu, &hdr->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
return;
@@ -524,7 +524,7 @@ static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct
list_add_tail_rcu(&d->hdr.list, add_pos);
- err = resctrl_online_mon_domain(r, d);
+ err = resctrl_online_mon_domain(r, &d->hdr);
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
@@ -537,7 +537,6 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
int id = get_domain_id_from_scope(cpu, r->mon_scope);
struct list_head *add_pos = NULL;
struct rdt_domain_hdr *hdr;
- struct rdt_mon_domain *d;
lockdep_assert_held(&domain_list_lock);
@@ -549,11 +548,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
if (hdr) {
- if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
-
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ cpumask_set_cpu(cpu, &hdr->cpu_mask);
return;
}
@@ -603,9 +598,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
hw_dom = resctrl_to_arch_ctrl_dom(d);
cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
+ if (cpumask_empty(&hdr->cpu_mask)) {
resctrl_offline_ctrl_domain(r, d);
- list_del_rcu(&d->hdr.list);
+ list_del_rcu(&hdr->list);
synchronize_rcu();
/*
@@ -653,8 +648,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
case RDT_RESOURCE_L3:
d = container_of(hdr, struct rdt_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
- resctrl_offline_mon_domain(r, d);
- list_del_rcu(&d->hdr.list);
+ resctrl_offline_mon_domain(r, hdr);
+ list_del_rcu(&hdr->list);
synchronize_rcu();
free_l3_mon_domain(hw_dom);
break;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a65f3e16bdab..0ec87db799b4 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3022,7 +3022,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
* when last domain being summed is removed.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
char subname[32];
@@ -3030,9 +3030,17 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
char name[32];
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
- if (snc_mode)
- sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ if (snc_mode) {
+ struct rdt_mon_domain *d;
+
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
+ } else {
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ }
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
@@ -3042,11 +3050,12 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
-static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp,
bool do_sum)
{
struct rmid_read rr = {0};
+ struct rdt_mon_domain *d;
struct mon_data *priv;
struct mon_evt *mevt;
int ret, domid;
@@ -3054,8 +3063,16 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
if (WARN_ON(list_empty(&r->evt_list)))
return -EPERM;
- list_for_each_entry(mevt, &r->evt_list, list) {
+ if (r->rid == RDT_RESOURCE_L3) {
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return -EINVAL;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
domid = do_sum ? d->ci->id : d->hdr.id;
+ } else {
+ domid = hdr->id;
+ }
+
+ list_for_each_entry(mevt, &r->evt_list, list) {
priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
if (WARN_ON_ONCE(!priv))
return -EINVAL;
@@ -3064,18 +3081,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
if (ret)
return ret;
- if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
+ if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
+ mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt->evtid, true);
}
return 0;
}
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_mon_domain *d,
+ struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
+ struct rdt_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
@@ -3083,7 +3101,14 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
lockdep_assert_held(&rdtgroup_mutex);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
+ if (snc_mode) {
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return -EINVAL;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ } else {
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ }
kn = kernfs_find_and_get(parent_kn, name);
if (kn) {
/*
@@ -3099,13 +3124,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
ret = rdtgroup_kn_set_ugid(kn);
if (ret)
goto out_destroy;
- ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
+ ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
if (ret)
goto out_destroy;
}
if (snc_mode) {
- sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
if (IS_ERR(ckn)) {
ret = -EINVAL;
@@ -3116,7 +3141,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (ret)
goto out_destroy;
- ret = mon_add_all_files(ckn, d, r, prgrp, false);
+ ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
if (ret)
goto out_destroy;
}
@@ -3134,7 +3159,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3142,12 +3167,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
parent_kn = crgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, crgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
}
}
}
@@ -3156,14 +3181,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_mon_domain *dom;
+ struct rdt_domain_hdr *hdr;
int ret;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(dom, &r->mon_domains, hdr.list) {
- ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
+ list_for_each_entry(hdr, &r->mon_domains, list) {
+ ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
if (ret)
return ret;
}
@@ -4031,8 +4056,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
mutex_unlock(&rdtgroup_mutex);
}
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
+ struct rdt_mon_domain *d;
+
mutex_lock(&rdtgroup_mutex);
/*
@@ -4040,11 +4067,15 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d);
+ rmdir_mondata_subdir_allrdtgrp(r, hdr);
if (r->rid != RDT_RESOURCE_L3)
goto done;
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4127,8 +4158,9 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
return err;
}
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
+ struct rdt_mon_domain *d;
int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
@@ -4136,6 +4168,10 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
if (r->rid != RDT_RESOURCE_L3)
goto out_unlock;
+ if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return err;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4156,7 +4192,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
* If resctrl is mounted, add per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- mkdir_mondata_subdir_allrdtgrp(r, d);
+ mkdir_mondata_subdir_allrdtgrp(r, hdr);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 11/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (9 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 10/31] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:39 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
` (19 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
These structures have generic names, but are only used for L3 monitor
events.
Rename:
rdt_mon_domain -> rdt_l3_mon_domain
rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 12 ++++----
arch/x86/kernel/cpu/resctrl/internal.h | 12 ++++----
fs/resctrl/internal.h | 12 ++++----
arch/x86/kernel/cpu/resctrl/core.c | 14 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 18 ++++++------
fs/resctrl/ctrlmondata.c | 6 ++--
fs/resctrl/monitor.c | 28 +++++++++---------
fs/resctrl/rdtgroup.c | 40 +++++++++++++-------------
8 files changed, 71 insertions(+), 71 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index bb55c449adc4..cd7881313d4e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -166,7 +166,7 @@ struct rdt_ctrl_domain {
};
/**
- * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
+ * struct rdt_l3_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @ci: cache info for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
@@ -176,7 +176,7 @@ struct rdt_ctrl_domain {
* @mbm_work_cpu: worker CPU for MBM h/w counters
* @cqm_work_cpu: worker CPU for CQM h/w counters
*/
-struct rdt_mon_domain {
+struct rdt_l3_mon_domain {
struct rdt_domain_hdr hdr;
struct cacheinfo *ci;
unsigned long *rmid_busy_llc;
@@ -335,7 +335,7 @@ struct resctrl_cpu_defaults {
struct resctrl_mon_config_info {
struct rdt_resource *r;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u32 evtid;
u32 mon_config;
};
@@ -475,7 +475,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *arch_mon_ctx);
@@ -522,7 +522,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid,
enum resctrl_event_id eventid);
@@ -535,7 +535,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/**
* resctrl_arch_reset_all_ctrls() - Reset the control for each CLOSID to its
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index b563406b4996..83b20e6b25d7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -51,15 +51,15 @@ struct rdt_hw_ctrl_domain {
};
/**
- * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ * struct rdt_hw_l3_mon_domain - Arch private attributes of a set of CPUs that share
* a resource for a monitor function
* @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_states: arch private state for each MBM event
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_mon_domain {
- struct rdt_mon_domain d_resctrl;
+struct rdt_hw_l3_mon_domain {
+ struct rdt_l3_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_states[QOS_NUM_MBM_EVENTS];
};
@@ -68,9 +68,9 @@ static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctr
return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
}
-static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
+static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3_mon_domain *r)
{
- return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
}
/**
@@ -122,7 +122,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
extern struct rdt_hw_resource rdt_resources_all[];
-void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index b69170760316..759768e2a2a8 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -129,7 +129,7 @@ struct mon_data {
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
enum resctrl_event_id evtid;
bool first;
struct cacheinfo *ci;
@@ -365,12 +365,12 @@ void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first);
int resctrl_mon_resource_init(void);
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
@@ -378,14 +378,14 @@ void mbm_handle_overflow(struct work_struct *work);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_mon_domain *d);
+bool has_busy_rmid(struct rdt_l3_mon_domain *d);
-void __check_limbo(struct rdt_mon_domain *d, bool force_free);
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free);
void resctrl_file_fflags_init(const char *config, unsigned long fflags);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9c78828ae32f..01843dd0b8b7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -362,7 +362,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void free_l3_mon_domain(struct rdt_hw_mon_domain *hw_dom)
+static void free_l3_mon_domain(struct rdt_hw_l3_mon_domain *hw_dom)
{
for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++)
kfree(hw_dom->arch_mbm_states[i]);
@@ -397,7 +397,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
{
size_t tsize = sizeof(struct arch_mbm_state);
enum resctrl_event_id evt;
@@ -495,8 +495,8 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- struct rdt_hw_mon_domain *hw_dom;
- struct rdt_mon_domain *d;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
int err;
hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
@@ -618,9 +618,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_mon_domain *hw_dom;
+ struct rdt_hw_l3_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
lockdep_assert_held(&domain_list_lock);
@@ -646,7 +646,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
switch (r->rid) {
case RDT_RESOURCE_L3:
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index d1f659dd6109..8d8ec86929fa 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -108,7 +108,7 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
*
* In RMID sharing mode there are fewer "logical RMID" values available
* to accumulate data ("physical RMIDs" are divided evenly between SNC
- * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
+ * nodes that share an L3 cache). Linux creates an rdt_l3_mon_domain for
* each SNC node.
*
* The value loaded into IA32_PQR_ASSOC is the "logical RMID".
@@ -156,7 +156,7 @@ static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_l3_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -177,11 +177,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
return state ? &state[rmid] : NULL;
}
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid,
enum resctrl_event_id eventid)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
u32 prmid;
@@ -200,9 +200,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
enum resctrl_event_id evt;
int idx;
@@ -223,11 +223,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
@@ -271,7 +271,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
* must adjust RMID counter numbers based on SNC node. See
* logical_rmid_to_physical_rmid() for code that does this.
*/
-void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
if (snc_nodes_per_l3_cache > 1)
msr_clear_bit(MSR_RMID_SNC_CONFIG, 0);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 3cbacfe52430..8c0f6d229130 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -526,7 +526,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
}
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first)
{
int cpu;
@@ -569,7 +569,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
struct kernfs_open_file *of = m->private;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
@@ -620,7 +620,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
}
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index a5a523f73249..19cba29452b7 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -130,7 +130,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_mon_domain *d, bool force_free)
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -188,7 +188,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}
-bool has_busy_rmid(struct rdt_mon_domain *d)
+bool has_busy_rmid(struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -289,7 +289,7 @@ int alloc_rmid(u32 closid)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u32 idx;
lockdep_assert_held(&rdtgroup_mutex);
@@ -342,7 +342,7 @@ void free_rmid(u32 closid, u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}
-static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -359,7 +359,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
int cpu = smp_processor_id();
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct mbm_state *m;
int err, ret;
u64 tval = 0;
@@ -532,7 +532,7 @@ static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu,
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_l3_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
}
-static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id evtid)
{
struct rmid_read rr = {0};
@@ -627,7 +627,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *
resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
-static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid)
{
/*
@@ -648,12 +648,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
- d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
__check_limbo(d, false);
@@ -676,7 +676,7 @@ void cqm_handle_limbo(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -693,7 +693,7 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
@@ -708,7 +708,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;
r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- d = container_of(work, struct rdt_mon_domain, mbm_over.work);
+ d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
@@ -742,7 +742,7 @@ void mbm_handle_overflow(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 0ec87db799b4..d2f9361694b6 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1613,7 +1613,7 @@ static void mondata_config_read(struct resctrl_mon_config_info *mon_info)
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct resctrl_mon_config_info mon_info;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
cpus_read_lock();
@@ -1661,7 +1661,7 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
}
static void mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_mon_domain *d, u32 evtid, u32 val)
+ struct rdt_l3_mon_domain *d, u32 evtid, u32 val)
{
struct resctrl_mon_config_info mon_info = {0};
@@ -1703,7 +1703,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
char *dom_str = NULL, *id_str;
unsigned long dom_id, val;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -2577,7 +2577,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
struct rdt_resource *r;
int ret;
@@ -3031,11 +3031,11 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
if (snc_mode) {
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
} else {
@@ -3055,7 +3055,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
bool do_sum)
{
struct rmid_read rr = {0};
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct mon_data *priv;
struct mon_evt *mevt;
int ret, domid;
@@ -3066,7 +3066,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
if (r->rid == RDT_RESOURCE_L3) {
if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
domid = do_sum ? d->ci->id : d->hdr.id;
} else {
domid = hdr->id;
@@ -3093,7 +3093,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
@@ -3104,7 +3104,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (snc_mode) {
if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
} else {
sprintf(name, "mon_%s_%02d", r->name, hdr->id);
@@ -4037,7 +4037,7 @@ static void rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}
-static void domain_destroy_mon_state(struct rdt_mon_domain *d)
+static void domain_destroy_mon_state(struct rdt_l3_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++) {
@@ -4058,7 +4058,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
mutex_lock(&rdtgroup_mutex);
@@ -4075,7 +4075,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4109,7 +4109,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
*
* Returns 0 for success, or -ENOMEM.
*/
-static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
+static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(struct mbm_state);
@@ -4160,7 +4160,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
@@ -4171,7 +4171,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
return err;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4218,10 +4218,10 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
}
}
-static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
- struct rdt_resource *r)
+static struct rdt_l3_mon_domain *get_mon_domain_from_cpu(int cpu,
+ struct rdt_resource *r)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
lockdep_assert_cpus_held();
@@ -4237,7 +4237,7 @@ static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
void resctrl_offline_cpu(unsigned int cpu)
{
struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct rdtgroup *rdtgrp;
mutex_lock(&rdtgroup_mutex);
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (10 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 11/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 3:54 ` Reinette Chatre
2025-05-13 3:19 ` Chen, Yu C
2025-04-29 0:33 ` [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats Tony Luck
` (18 subsequent siblings)
30 siblings, 2 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Resctrl file system code was built with the assumption that monitor
events can only be read from a CPU in the cpumast_t set for each
domain. This was true for x86 events accessed with an MSR interface,
but may not be true for other access methods such as MMIO.
Add a flag to each instance of struct mon_evt that can be set by
architecture code to indicate there is no restriction on which
CPU can read the event counter.
Change struct mon_data and struct rmid_read to have a pointer to
the struct mon_evt instead of the event id.
Add an extra argument to resctrl_enable_mon_event() so architecture
code can indicate which events can be read on any CPU when enabling
the event.
Bypass all the smp_call*() code for events that can be read on any CPU
and call mon_event_count() directly from mon_event_read().
Skip checks in __mon_event_count() that the read is being done from
a CPU in the correct domain or cache scope.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
fs/resctrl/internal.h | 12 +++++++-----
arch/x86/kernel/cpu/resctrl/core.c | 6 +++---
fs/resctrl/ctrlmondata.c | 24 +++++++++++++++---------
fs/resctrl/monitor.c | 26 ++++++++++++++------------
fs/resctrl/rdtgroup.c | 6 +++---
6 files changed, 43 insertions(+), 33 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index cd7881313d4e..4af5e8d30193 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -379,7 +379,7 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
-void resctrl_enable_mon_event(enum resctrl_event_id evtid);
+void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 759768e2a2a8..d8aa69b42c74 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -72,6 +72,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @name: name of the event
* @configurable: true if the event is configurable
* @enabled: true if the event is enabled
+ * @any_cpu: true if the event can be read from any CPU
* @list: entry in &rdt_resource->evt_list
*/
struct mon_evt {
@@ -80,6 +81,7 @@ struct mon_evt {
char *name;
bool configurable;
bool enabled;
+ bool any_cpu;
struct list_head list;
};
@@ -89,7 +91,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
* @rid: Resource id associated with the event file.
- * @evtid: Event id associated with the event file.
+ * @evt: Event associated with the event file.
* @sum: Set when event must be summed across multiple
* domains.
* @domid: When @sum is zero this is the domain to which
@@ -103,7 +105,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
struct mon_data {
struct list_head list;
enum resctrl_res_level rid;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
int domid;
bool sum;
};
@@ -116,7 +118,7 @@ struct mon_data {
* @r: Resource describing the properties of the event being read.
* @d: Domain that the counter should be read from. If NULL then sum all
* domains in @r sharing L3 @ci.id
- * @evtid: Which monitor event to read.
+ * @evt: Which monitor event to read.
* @first: Initialize MBM counter when true.
* @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
* @err: Error encountered when reading counter.
@@ -130,7 +132,7 @@ struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
struct rdt_l3_mon_domain *d;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
bool first;
struct cacheinfo *ci;
int err;
@@ -366,7 +368,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first);
+ cpumask_t *cpumask, struct mon_evt *evt, int first);
int resctrl_mon_resource_init(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 01843dd0b8b7..58bc218070e2 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -877,15 +877,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
ret = true;
}
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 8c0f6d229130..7a2957b9c13e 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -527,7 +527,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first)
+ cpumask_t *cpumask, struct mon_evt *evt, int first)
{
int cpu;
@@ -538,16 +538,21 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* Setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
- rr->evtid = evtid;
+ rr->evt = evt;
rr->r = r;
rr->d = d;
rr->first = first;
- rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid);
+ rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evt->evtid);
if (IS_ERR(rr->arch_mon_ctx)) {
rr->err = -EINVAL;
return;
}
+ if (evt->any_cpu) {
+ mon_event_count(rr);
+ goto done;
+ }
+
cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
/*
@@ -560,8 +565,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
smp_call_function_any(cpumask, mon_event_count, rr, 1);
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
-
- resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
+done:
+ resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
@@ -570,7 +575,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
struct rdt_l3_mon_domain *d;
- u32 resid, evtid, domid;
+ struct mon_evt *evt;
+ u32 resid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
struct mon_data *md;
@@ -590,7 +596,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
resid = md->rid;
domid = md->domid;
- evtid = md->evtid;
+ evt = md->evt;
r = resctrl_arch_get_resource(resid);
if (md->sum) {
@@ -604,7 +610,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
if (d->ci->id == domid) {
rr.ci = d->ci;
mon_event_read(&rr, r, NULL, rdtgrp,
- &d->ci->shared_cpu_map, evtid, false);
+ &d->ci->shared_cpu_map, evt, false);
goto checkresult;
}
}
@@ -621,7 +627,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
goto out;
}
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
+ mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evt, false);
}
checkresult:
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 19cba29452b7..e903d3c076ee 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -365,19 +365,19 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
u64 tval = 0;
if (rr->first) {
- resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evt->evtid);
+ m = get_mbm_state(rr->d, closid, rmid, rr->evt->evtid);
if (m)
memset(m, 0, sizeof(struct mbm_state));
return 0;
}
if (rr->d) {
- /* Reading a single domain, must be on a CPU in that domain. */
- if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
+ /* Reading a single domain, must usually be on a CPU in that domain. */
+ if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
return -EINVAL;
rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -387,7 +387,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
}
/* Summing domains that share a cache, must be on a CPU for that cache. */
- if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
+ if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
return -EINVAL;
/*
@@ -402,7 +402,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
if (d->ci->id != rr->ci->id)
continue;
err = resctrl_arch_rmid_read(rr->r, d, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
ret = 0;
@@ -432,7 +432,7 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
u64 cur_bw, bytes, cur_bytes;
struct mbm_state *m;
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ m = get_mbm_state(rr->d, closid, rmid, rr->evt->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -603,12 +603,13 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_l3_mon_domain *dom_m
static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id evtid)
{
+ struct mon_evt *evt = &mon_event_all[evtid];
struct rmid_read rr = {0};
rr.r = r;
rr.d = d;
- rr.evtid = evtid;
- rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ rr.evt = evt;
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, evtid);
if (IS_ERR(rr.arch_mon_ctx)) {
pr_warn_ratelimited("Failed to allocate monitor context: %ld",
PTR_ERR(rr.arch_mon_ctx));
@@ -624,7 +625,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domai
if (is_mba_sc(NULL))
mbm_bw_count(closid, rmid, &rr);
- resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
+ resctrl_arch_mon_ctx_free(rr.r, evtid, rr.arch_mon_ctx);
}
static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
@@ -859,12 +860,13 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
},
};
-void resctrl_enable_mon_event(enum resctrl_event_id evtid)
+void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
{
if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
return;
mon_event_all[evtid].enabled = true;
+ mon_event_all[evtid].any_cpu = any_cpu;
}
bool resctrl_is_mon_event_enabled(enum resctrl_event_id evtid)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index d2f9361694b6..d16bb05fafe8 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2897,7 +2897,7 @@ static struct mon_data *mon_get_kn_priv(int rid, int domid,
list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
if (priv->rid == rid && priv->domid == domid &&
- priv->sum == do_sum && priv->evtid == mevt->evtid)
+ priv->sum == do_sum && priv->evt == mevt)
return priv;
}
@@ -2908,7 +2908,7 @@ static struct mon_data *mon_get_kn_priv(int rid, int domid,
priv->rid = rid;
priv->domid = domid;
priv->sum = do_sum;
- priv->evtid = mevt->evtid;
+ priv->evt = mevt;
list_add_tail(&priv->list, &mon_data_kn_priv_list);
return priv;
@@ -3082,7 +3082,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
return ret;
if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt->evtid, true);
+ mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt, true);
}
return 0;
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (11 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:49 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 14/31] fs/resctrl: Add an architectural hook called for each mount Tony Luck
` (17 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Resctrl was written with the assumption that all monitor events
can be displayed as unsigned decimal integers.
Some telemetry events provide greater precision where architecture code
uses a fixed point format with 18 binary places.
Add a "display_format" field to struct mon_evt which can specify
that the value for the event be displayed as an integer for legacy
events, or as a floating point value with six decimal places converted
from the fixed point format received from architecture code.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl_types.h | 5 +++++
fs/resctrl/internal.h | 2 ++
fs/resctrl/ctrlmondata.c | 24 +++++++++++++++++++++++-
fs/resctrl/monitor.c | 21 ++++++++++++---------
4 files changed, 42 insertions(+), 10 deletions(-)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index 5ef14a24008c..6245034f6c76 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -50,4 +50,9 @@ enum resctrl_event_id {
#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
+/* Event value display formats */
+enum resctrl_event_fmt {
+ EVT_FORMAT_U64,
+ EVT_FORMAT_U46_18,
+};
#endif /* __LINUX_RESCTRL_TYPES_H */
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index d8aa69b42c74..aaa74a17257d 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -73,6 +73,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @configurable: true if the event is configurable
* @enabled: true if the event is enabled
* @any_cpu: true if the event can be read from any CPU
+ * @display_format: format to display value to users
* @list: entry in &rdt_resource->evt_list
*/
struct mon_evt {
@@ -82,6 +83,7 @@ struct mon_evt {
bool configurable;
bool enabled;
bool any_cpu;
+ enum resctrl_event_fmt display_format;
struct list_head list;
};
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 7a2957b9c13e..1544c103446b 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -569,6 +569,28 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
+#define NUM_FRAC_BITS 18
+#define FRAC_MASK GENMASK(NUM_FRAC_BITS - 1, 0)
+
+static void print_event_value(struct seq_file *m, enum resctrl_event_fmt type, u64 val)
+{
+ u64 frac;
+
+ switch (type) {
+ case EVT_FORMAT_U64:
+ seq_printf(m, "%llu\n", val);
+ break;
+ case EVT_FORMAT_U46_18:
+ frac = val & FRAC_MASK;
+ frac = frac * 1000000;
+ /* round values up to nearest decimal representation */
+ frac += 1ul << (NUM_FRAC_BITS - 1);
+ frac >>= NUM_FRAC_BITS;
+ seq_printf(m, "%llu.%06llu\n", val >> NUM_FRAC_BITS, frac);
+ break;
+ }
+}
+
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
@@ -637,7 +659,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
else if (rr.err == -EINVAL)
seq_puts(m, "Unavailable\n");
else
- seq_printf(m, "%llu\n", rr.val);
+ print_event_value(m, evt->display_format, rr.val);
out:
rdtgroup_kn_unlock(of->kn);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index e903d3c076ee..be78488a15e5 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -844,19 +844,22 @@ static void dom_data_exit(struct rdt_resource *r)
struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
[QOS_L3_OCCUP_EVENT_ID] = {
- .name = "llc_occupancy",
- .evtid = QOS_L3_OCCUP_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
+ .name = "llc_occupancy",
+ .evtid = QOS_L3_OCCUP_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
+ .display_format = EVT_FORMAT_U64,
},
[QOS_L3_MBM_TOTAL_EVENT_ID] = {
- .name = "mbm_total_bytes",
- .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
+ .name = "mbm_total_bytes",
+ .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
+ .display_format = EVT_FORMAT_U64,
},
[QOS_L3_MBM_LOCAL_EVENT_ID] = {
- .name = "mbm_local_bytes",
- .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
+ .name = "mbm_local_bytes",
+ .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
+ .display_format = EVT_FORMAT_U64,
},
};
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 14/31] fs/resctrl: Add an architectural hook called for each mount
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (12 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:50 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 15/31] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
` (16 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Enumeration of Intel telemetry events is not complete when the
resctrl "late_init" code is executed.
Add a hook at the beginning of the mount code that will be used
to check for telemetry events and initialize if any are found.
The hook is called on every attempted mount. But expectations are that
most actions (like enumeration) will only need to be performed
on the first call.
The call is made with no locks held. Architecture code is responsible
for any required locking.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 6 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 8 ++++++++
fs/resctrl/rdtgroup.c | 2 ++
3 files changed, 16 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 4af5e8d30193..6f424fffa083 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -450,6 +450,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
+/*
+ * Architecture hook called for each attempted file system mount
+ * No locks are held.
+ */
+void resctrl_arch_pre_mount(void);
+
/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 58bc218070e2..2f3efc4b1816 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -707,6 +707,14 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
return 0;
}
+void resctrl_arch_pre_mount(void)
+{
+ static atomic_t only_once;
+
+ if (atomic_cmpxchg(&only_once, 0, 1))
+ return;
+}
+
enum {
RDT_FLAG_CMT,
RDT_FLAG_MBM_TOTAL,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index d16bb05fafe8..da71057f3ff4 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2581,6 +2581,8 @@ static int rdt_get_tree(struct fs_context *fc)
struct rdt_resource *r;
int ret;
+ resctrl_arch_pre_mount();
+
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
/*
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 15/31] x86/resctrl: Add and initialize rdt_resource for package scope core monitor
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (13 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 14/31] fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:50 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 16/31] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
` (15 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Counts for each Intel telemetry event are periodically sent to one or
more aggregators on each package where accumulated totals are made
available in MMIO registers.
Add a new resource for monitoring these events with code to build
domains at the package granularity.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 11 +++++++++++
2 files changed, 13 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 6f424fffa083..3ae50b947a99 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,6 +53,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_PERF_PKG,
/* Must be the last */
RDT_NUM_RESOURCES,
@@ -250,6 +251,7 @@ enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
RESCTRL_L3_NODE,
+ RESCTRL_PACKAGE,
};
/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 2f3efc4b1816..4d1556707c01 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -99,6 +99,15 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
.schema_fmt = RESCTRL_SCHEMA_RANGE,
},
},
+ [RDT_RESOURCE_PERF_PKG] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .name = "PERF_PKG",
+ .mon_scope = RESCTRL_PACKAGE,
+ .mon_domains = mon_domain_init(RDT_RESOURCE_PERF_PKG),
+ },
+ },
};
u32 resctrl_arch_system_num_rmid_idx(void)
@@ -430,6 +439,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return get_cpu_cacheinfo_id(cpu, scope);
case RESCTRL_L3_NODE:
return cpu_to_node(cpu);
+ case RESCTRL_PACKAGE:
+ return topology_physical_package_id(cpu);
default:
break;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 16/31] x86/resctrl: Add first part of telemetry event enumeration
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (14 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 15/31] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:53 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 17/31] x86/resctrl: Add second " Tony Luck
` (14 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The OOBMSM VSEC discovery driver enumerates many different types
of telemetry resources. Resctrl is only interested in the ones
that are tied to an RMID value in the IA32_PQR_ASSOC MSR.
Make a request for each of the FEATURE_PER_RMID_ENERGY_TELEM and
FEATURE_PER_RMID_PERF_TELEM feature groups and scan the list
of known event groups for matching guid values.
Configuration to follow in subsequent patches.
Hold onto references to any pmt_feature_groups that resctrl
uses until resctrl exit.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 8 ++
arch/x86/kernel/cpu/resctrl/core.c | 5 ++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 113 ++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
4 files changed, 127 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 83b20e6b25d7..571db665eca6 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -167,4 +167,12 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
+#ifdef CONFIG_INTEL_AET_RESCTRL
+bool intel_aet_get_events(void);
+void __exit intel_aet_exit(void);
+#else
+static inline bool intel_aet_get_events(void) { return false; }
+static inline void intel_aet_exit(void) { };
+#endif
+
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4d1556707c01..0103f577e4ca 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -724,6 +724,9 @@ void resctrl_arch_pre_mount(void)
if (atomic_cmpxchg(&only_once, 0, 1))
return;
+
+ if (!intel_aet_get_events())
+ return;
}
enum {
@@ -1076,6 +1079,8 @@ late_initcall(resctrl_arch_late_init);
static void __exit resctrl_arch_exit(void)
{
+ intel_aet_exit();
+
cpuhp_remove_state(rdt_online);
resctrl_exit();
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
new file mode 100644
index 000000000000..dda44baf75ae
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Resource Director Technology(RDT)
+ * - Intel Application Energy Telemetry
+ *
+ * Copyright (C) 2025 Intel Corporation
+ *
+ * Author:
+ * Tony Luck <tony.luck@intel.com>
+ */
+
+#define pr_fmt(fmt) "resctrl: " fmt
+
+#include <linux/cleanup.h>
+#include <linux/cpu.h>
+#include <linux/resctrl.h>
+
+/* Temporary - delete from final version */
+#include "fake_intel_aet_features.h"
+
+#include "internal.h"
+
+/**
+ * struct event_group - All information about a group of telemetry events.
+ * Some fields initialized with MMIO layout information
+ * gleaned from the XML files. Others are set from data
+ * retrieved from intel_pmt_get_regions_by_feature().
+ * @pfg: The pmt_feature_group for this event group
+ * @guid: Unique number per XML description file
+ */
+struct event_group {
+ struct pmt_feature_group *pfg;
+ int guid;
+};
+
+/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
+static struct event_group energy_0x26696143 = {
+ .guid = 0x26696143,
+};
+
+/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
+static struct event_group perf_0x26557651 = {
+ .guid = 0x26557651,
+};
+
+static struct event_group *known_event_groups[] = {
+ &energy_0x26696143,
+ &perf_0x26557651,
+};
+
+#define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
+
+static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
+{
+ return false;
+}
+
+DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
+ if (!IS_ERR_OR_NULL(_T)) \
+ intel_pmt_put_feature_group(_T))
+
+static bool get_pmt_feature(enum pmt_feature_id feature)
+{
+ struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
+ struct event_group **peg;
+ bool ret;
+
+ p = intel_pmt_get_regions_by_feature(feature);
+
+ if (IS_ERR_OR_NULL(p))
+ return false;
+
+ for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
+ for (int i = 0; i < p->count; i++) {
+ if ((*peg)->guid == p->regions[i].guid) {
+ ret = configure_events((*peg), p);
+ if (ret) {
+ (*peg)->pfg = no_free_ptr(p);
+ return true;
+ }
+ break;
+ }
+ }
+ }
+
+ return false;
+}
+
+/*
+ * Ask OOBMSM discovery driver for all the RMID based telemetry groups
+ * that it supports.
+ */
+bool intel_aet_get_events(void)
+{
+ bool ret1, ret2;
+
+ ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM);
+ ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM);
+
+ return ret1 || ret2;
+}
+
+void __exit intel_aet_exit(void)
+{
+ struct event_group **peg;
+
+ for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
+ if ((*peg)->pfg) {
+ intel_pmt_put_feature_group((*peg)->pfg);
+ (*peg)->pfg = NULL;
+ }
+ }
+}
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index 28ae1c88b2ac..8b4603cad783 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_INTEL_AET_RESCTRL) += intel_aet.o
obj-$(CONFIG_INTEL_AET_RESCTRL) += fake_intel_aet_features.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 17/31] x86/resctrl: Add second part of telemetry event enumeration
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (15 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 16/31] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:54 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 18/31] x86/resctrl: Add third " Tony Luck
` (13 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There may be multiple telemetry aggregators per package, each enumerated
by a telemetry region structure in the feature group.
Scan the array of telemetry region structures and count how many are
in each package in preparation to allocate structures to save the MMIO
addresses for each in a convenient format for use when reading event
counters.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index dda44baf75ae..a0365c3ce982 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -52,6 +52,27 @@ static struct event_group *known_event_groups[] = {
static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
{
+ int *pkgcounts __free(kfree) = NULL;
+ struct telemetry_region *tr;
+ int num_pkgs;
+
+ num_pkgs = topology_max_packages();
+ pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
+ if (!pkgcounts)
+ return false;
+
+ /* Get per-package counts of telemetry_region for this guid */
+ for (int i = 0; i < p->count; i++) {
+ tr = &p->regions[i];
+ if (tr->guid != e->guid)
+ continue;
+ if (tr->plat_info.package_id >= num_pkgs) {
+ pr_warn_once("Bad package %d\n", tr->plat_info.package_id);
+ continue;
+ }
+ pkgcounts[tr->plat_info.package_id]++;
+ }
+
return false;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 18/31] x86/resctrl: Add third part of telemetry event enumeration
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (16 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 17/31] x86/resctrl: Add second " Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:56 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 19/31] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
` (12 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Counters for telemetry events are in MMIO space. Each telemetry_region
structure returned in the pmt_feature_group returned from OOBMSM
contains the base MMIO address for the counters.
Scan all the telemetry_region structures again and gather these
addresses into a more convenient structure with addresses for
each aggregator indexed by package id. Note that there may be
multiple aggregators per package.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 55 +++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index a0365c3ce982..03839d5c369b 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -20,6 +20,16 @@
#include "internal.h"
+/**
+ * struct mmio_info - Array of MMIO addresses for a package
+ * @count: Number of addresses on this package
+ * @addrs: The MMIO addresses
+ */
+struct mmio_info {
+ int count;
+ void __iomem *addrs[] __counted_by(count);
+};
+
/**
* struct event_group - All information about a group of telemetry events.
* Some fields initialized with MMIO layout information
@@ -27,10 +37,12 @@
* retrieved from intel_pmt_get_regions_by_feature().
* @pfg: The pmt_feature_group for this event group
* @guid: Unique number per XML description file
+ * @pkginfo: Per-package MMIO addresses
*/
struct event_group {
struct pmt_feature_group *pfg;
int guid;
+ struct mmio_info **pkginfo;
};
/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
@@ -50,12 +62,33 @@ static struct event_group *known_event_groups[] = {
#define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
+static void free_mmio_info(struct mmio_info **mmi)
+{
+ int num_pkgs = topology_max_packages();
+
+ if (!mmi)
+ return;
+
+ for (int i = 0; i < num_pkgs; i++)
+ kfree(mmi[i]);
+ kfree(mmi);
+}
+
+DEFINE_FREE(mmio_info, struct mmio_info **, free_mmio_info(_T))
+
static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
{
+ struct mmio_info __free(mmio_info) **pkginfo = NULL;
int *pkgcounts __free(kfree) = NULL;
struct telemetry_region *tr;
+ struct mmio_info *mmi;
int num_pkgs;
+ if (e->pkginfo) {
+ pr_warn("Duplicate telemetry information for guid 0x%x\n", e->guid);
+ return false;
+ }
+
num_pkgs = topology_max_packages();
pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
if (!pkgcounts)
@@ -73,6 +106,27 @@ static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
pkgcounts[tr->plat_info.package_id]++;
}
+ /* Allocate per-package arrays and save MMIO addresses */
+ pkginfo = kcalloc(num_pkgs, sizeof(*pkginfo), GFP_KERNEL);
+ if (!pkginfo)
+ return false;
+ for (int i = 0; i < num_pkgs; i++) {
+ pkginfo[i] = kmalloc(struct_size(pkginfo[i], addrs, pkgcounts[i]), GFP_KERNEL);
+ if (!pkginfo[i])
+ return false;
+ pkginfo[i]->count = pkgcounts[i];
+ }
+
+ /* Save MMIO address(es) for each aggregator in per-package structures */
+ for (int i = 0; i < p->count; i++) {
+ tr = &p->regions[i];
+ if (tr->guid != e->guid || tr->plat_info.package_id >= num_pkgs)
+ continue;
+ mmi = pkginfo[tr->plat_info.package_id];
+ mmi->addrs[--pkgcounts[tr->plat_info.package_id]] = tr->addr;
+ }
+ e->pkginfo = no_free_ptr(pkginfo);
+
return false;
}
@@ -130,5 +184,6 @@ void __exit intel_aet_exit(void)
intel_pmt_put_feature_group((*peg)->pfg);
(*peg)->pfg = NULL;
}
+ free_mmio_info((*peg)->pkginfo);
}
}
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 19/31] x86,fs/resctrl: Fill in details of Clearwater Forest events
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (17 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 18/31] x86/resctrl: Add third " Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:54 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 20/31] x86/resctrl: Check for adequate MMIO space Tony Luck
` (11 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Clearwater Forest supports two energy related telemetry events
and seven perf style events.
Define these events in the file system code and add the events
to the event_group structures.
PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point
format. File system code must output as floating point values.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl_types.h | 11 +++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 31 ++++++++++++++
fs/resctrl/monitor.c | 54 +++++++++++++++++++++++++
3 files changed, 96 insertions(+)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index 6245034f6c76..39de5451cff8 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -43,6 +43,17 @@ enum resctrl_event_id {
QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
+ /* Intel Telemetry Events */
+ PMT_EVENT_ENERGY,
+ PMT_EVENT_ACTIVITY,
+ PMT_EVENT_STALLS_LLC_HIT,
+ PMT_EVENT_C1_RES,
+ PMT_EVENT_UNHALTED_CORE_CYCLES,
+ PMT_EVENT_STALLS_LLC_MISS,
+ PMT_EVENT_AUTO_C6_RES,
+ PMT_EVENT_UNHALTED_REF_CYCLES,
+ PMT_EVENT_UOPS_RETIRED,
+
/* Must be the last */
QOS_NUM_EVENTS,
};
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 03839d5c369b..7e4f6a6672d4 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -30,6 +30,18 @@ struct mmio_info {
void __iomem *addrs[] __counted_by(count);
};
+/**
+ * struct pmt_event - Telemetry event.
+ * @evtid: Resctrl event id
+ * @evt_idx: Counter index within each per-RMID block of counters
+ */
+struct pmt_event {
+ enum resctrl_event_id evtid;
+ int evt_idx;
+};
+
+#define EVT(id, idx) { .evtid = id, .evt_idx = idx }
+
/**
* struct event_group - All information about a group of telemetry events.
* Some fields initialized with MMIO layout information
@@ -38,21 +50,40 @@ struct mmio_info {
* @pfg: The pmt_feature_group for this event group
* @guid: Unique number per XML description file
* @pkginfo: Per-package MMIO addresses
+ * @num_events: Number of events in this group
+ * @evts: Array of event descriptors
*/
struct event_group {
struct pmt_feature_group *pfg;
int guid;
struct mmio_info **pkginfo;
+ int num_events;
+ struct pmt_event evts[] __counted_by(num_events);
};
/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
static struct event_group energy_0x26696143 = {
.guid = 0x26696143,
+ .num_events = 2,
+ .evts = {
+ EVT(PMT_EVENT_ENERGY, 0),
+ EVT(PMT_EVENT_ACTIVITY, 1),
+ }
};
/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
static struct event_group perf_0x26557651 = {
.guid = 0x26557651,
+ .num_events = 7,
+ .evts = {
+ EVT(PMT_EVENT_STALLS_LLC_HIT, 0),
+ EVT(PMT_EVENT_C1_RES, 1),
+ EVT(PMT_EVENT_UNHALTED_CORE_CYCLES, 2),
+ EVT(PMT_EVENT_STALLS_LLC_MISS, 3),
+ EVT(PMT_EVENT_AUTO_C6_RES, 4),
+ EVT(PMT_EVENT_UNHALTED_REF_CYCLES, 5),
+ EVT(PMT_EVENT_UOPS_RETIRED, 6),
+ }
};
static struct event_group *known_event_groups[] = {
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index be78488a15e5..f848325591b4 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -861,6 +861,60 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
.rid = RDT_RESOURCE_L3,
.display_format = EVT_FORMAT_U64,
},
+ [PMT_EVENT_ENERGY] = {
+ .name = "core_energy",
+ .evtid = PMT_EVENT_ENERGY,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U46_18,
+ },
+ [PMT_EVENT_ACTIVITY] = {
+ .name = "activity",
+ .evtid = PMT_EVENT_ACTIVITY,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U46_18,
+ },
+ [PMT_EVENT_STALLS_LLC_HIT] = {
+ .name = "stalls_llc_hit",
+ .evtid = PMT_EVENT_STALLS_LLC_HIT,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U64,
+ },
+ [PMT_EVENT_C1_RES] = {
+ .name = "c1_res",
+ .evtid = PMT_EVENT_C1_RES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U64,
+ },
+ [PMT_EVENT_UNHALTED_CORE_CYCLES] = {
+ .name = "unhalted_core_cycles",
+ .evtid = PMT_EVENT_UNHALTED_CORE_CYCLES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U64,
+ },
+ [PMT_EVENT_STALLS_LLC_MISS] = {
+ .name = "stalls_llc_miss",
+ .evtid = PMT_EVENT_STALLS_LLC_MISS,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U64,
+ },
+ [PMT_EVENT_AUTO_C6_RES] = {
+ .name = "c6_res",
+ .evtid = PMT_EVENT_AUTO_C6_RES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U64,
+ },
+ [PMT_EVENT_UNHALTED_REF_CYCLES] = {
+ .name = "unhalted_ref_cycles",
+ .evtid = PMT_EVENT_UNHALTED_REF_CYCLES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U64,
+ },
+ [PMT_EVENT_UOPS_RETIRED] = {
+ .name = "uops_retired",
+ .evtid = PMT_EVENT_UOPS_RETIRED,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ .display_format = EVT_FORMAT_U64,
+ },
};
void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 20/31] x86/resctrl: Check for adequate MMIO space
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (18 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 19/31] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:56 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 21/31] x86/resctrl: Add fourth part of telemetry event enumeration Tony Luck
` (10 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The MMIO space for each telemetry aggregator is arranged as a array of
count registers for each event for RMID 0, followed by RMID 1, and so on.
After all event counters there are three status registers. All registers
are 8 bytes each.
The total size of MMIO space as described by the XML files is thus:
(NUM_RMIDS * NUM_COUNTERS + 3) * 8
Add an "mmio_size" field to the event_group structure and a sanity
check that the size reported in the telemetry_region structure obtained
from intel_pmt_get_regions_by_feature() is as large as expected.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 7e4f6a6672d4..37dd493df250 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -49,6 +49,7 @@ struct pmt_event {
* retrieved from intel_pmt_get_regions_by_feature().
* @pfg: The pmt_feature_group for this event group
* @guid: Unique number per XML description file
+ * @mmio_size: Number of bytes of mmio registers for this group
* @pkginfo: Per-package MMIO addresses
* @num_events: Number of events in this group
* @evts: Array of event descriptors
@@ -56,6 +57,7 @@ struct pmt_event {
struct event_group {
struct pmt_feature_group *pfg;
int guid;
+ int mmio_size;
struct mmio_info **pkginfo;
int num_events;
struct pmt_event evts[] __counted_by(num_events);
@@ -64,6 +66,7 @@ struct event_group {
/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
static struct event_group energy_0x26696143 = {
.guid = 0x26696143,
+ .mmio_size = (576 * 2 + 3) * 8,
.num_events = 2,
.evts = {
EVT(PMT_EVENT_ENERGY, 0),
@@ -74,6 +77,7 @@ static struct event_group energy_0x26696143 = {
/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
static struct event_group perf_0x26557651 = {
.guid = 0x26557651,
+ .mmio_size = (576 * 7 + 3) * 8,
.num_events = 7,
.evts = {
EVT(PMT_EVENT_STALLS_LLC_HIT, 0),
@@ -134,6 +138,10 @@ static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
pr_warn_once("Bad package %d\n", tr->plat_info.package_id);
continue;
}
+ if (tr->size < e->mmio_size) {
+ pr_warn_once("MMIO space too small for guid 0x%x\n", e->guid);
+ continue;
+ }
pkgcounts[tr->plat_info.package_id]++;
}
@@ -151,7 +159,8 @@ static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
/* Save MMIO address(es) for each aggregator in per-package structures */
for (int i = 0; i < p->count; i++) {
tr = &p->regions[i];
- if (tr->guid != e->guid || tr->plat_info.package_id >= num_pkgs)
+ if (tr->guid != e->guid || tr->plat_info.package_id >= num_pkgs ||
+ tr->size < e->mmio_size)
continue;
mmi = pkginfo[tr->plat_info.package_id];
mmi->addrs[--pkgcounts[tr->plat_info.package_id]] = tr->addr;
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 21/31] x86/resctrl: Add fourth part of telemetry event enumeration
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (19 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 20/31] x86/resctrl: Check for adequate MMIO space Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:56 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 22/31] x86/resctrl: Read core telemetry events Tony Luck
` (9 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
At run time when a user reads an event file the file system code
provides the enum resctrl_event_id for the event.
Create a lookup table indexed by event id to provide the telem_entry
structure and the event index into MMIO space.
Enable the events marked as readable from any CPU.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 37dd493df250..e1cb6bd4788d 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -97,6 +97,16 @@ static struct event_group *known_event_groups[] = {
#define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
+/**
+ * struct evtinfo - lookup table from resctrl_event_id to useful information
+ * @event_group: Pointer to the telem_entry structure for this event
+ * @idx: Counter index within each per-RMID block of counters
+ */
+static struct evtinfo {
+ struct event_group *event_group;
+ int idx;
+} evtinfo[QOS_NUM_EVENTS];
+
static void free_mmio_info(struct mmio_info **mmi)
{
int num_pkgs = topology_max_packages();
@@ -167,7 +177,16 @@ static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
}
e->pkginfo = no_free_ptr(pkginfo);
- return false;
+ for (int i = 0; i < e->num_events; i++) {
+ enum resctrl_event_id evt;
+
+ evt = e->evts[i].evtid;
+ evtinfo[evt].event_group = e;
+ evtinfo[evt].idx = e->evts[i].evt_idx;
+ resctrl_enable_mon_event(evt, true);
+ }
+
+ return true;
}
DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 22/31] x86/resctrl: Read core telemetry events
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (20 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 21/31] x86/resctrl: Add fourth part of telemetry event enumeration Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:57 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 23/31] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
` (8 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The resctrl file system passes requests to read event monitor files to
the architecture resctrl_arch_rmid_read() function to collect values
from hardware counters.
Use the resctrl resource to differentiate between calls to read legacy
L3 events from the new telemetry events (which are attached to
RDT_RESOURCE_PERF_PKG).
There may be multiple devices tracking each package, so scan all of them
and add up all counters.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 5 ++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 34 +++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 3 +++
3 files changed, 42 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 571db665eca6..dd5fe8a98304 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -170,9 +170,14 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
#ifdef CONFIG_INTEL_AET_RESCTRL
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
+int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void intel_aet_exit(void) { };
+static inline int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val)
+{
+ return -EINVAL;
+}
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index e1cb6bd4788d..0bbf991da981 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -13,6 +13,7 @@
#include <linux/cleanup.h>
#include <linux/cpu.h>
+#include <linux/io.h>
#include <linux/resctrl.h>
/* Temporary - delete from final version */
@@ -246,3 +247,36 @@ void __exit intel_aet_exit(void)
free_mmio_info((*peg)->pkginfo);
}
}
+
+#define VALID_BIT BIT_ULL(63)
+#define DATA_BITS GENMASK_ULL(62, 0)
+
+/*
+ * Read counter for an event on a domain (summing all aggregators
+ * on the domain).
+ */
+int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val)
+{
+ struct evtinfo *info = &evtinfo[evtid];
+ struct mmio_info *mmi;
+ u64 evtcount;
+ int idx;
+
+ idx = rmid * info->event_group->num_events;
+ idx += info->idx;
+ mmi = info->event_group->pkginfo[domid];
+
+ if (idx * sizeof(u64) > info->event_group->mmio_size) {
+ pr_warn_once("MMIO index %d out of range\n", idx);
+ return -EINVAL;
+ }
+
+ for (int i = 0; i < mmi->count; i++) {
+ evtcount = readq(mmi->addrs[i] + idx * sizeof(u64));
+ if (!(evtcount & VALID_BIT))
+ return -EINVAL;
+ *val += evtcount & DATA_BITS;
+ }
+
+ return 0;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 8d8ec86929fa..04214585824b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -237,6 +237,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
resctrl_arch_rmid_read_context_check();
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ return intel_aet_read_event(d->hdr.id, rmid, eventid, val);
+
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
ret = __rmid_read_phys(prmid, eventid, &msr_val);
if (ret)
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 23/31] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (21 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 22/31] x86/resctrl: Read core telemetry events Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:58 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 24/31] fs/resctrl: Add type define for PERF_PKG files Tony Luck
` (7 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The L3 resource has several requirements for domains. There are structures
that hold the 64-bit values of counters, and elements to keep track of
the overflow and limbo threads.
None of these are needed for the PERF_PKG resource. The hardware counters
are wide enough that they do not wrap around for decades.
Define a new rdt_perf_pkg_mon_domain structure which just consists of
the standard rdt_domain_hdr to keep track of domain id and CPU mask.
Change domain_add_cpu_mon(), domain_remove_cpu_mon(),
resctrl_offline_mon_domain(), and resctrl_online_mon_domain() to check
resource type and perform only the operations needed for domsins in the
PERF_PKG resource.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 41 ++++++++++++++++++++++++++++++
fs/resctrl/rdtgroup.c | 4 +++
2 files changed, 45 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0103f577e4ca..97fb2001c8d8 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -543,6 +543,38 @@ static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct
}
}
+/**
+ * struct rdt_perf_pkg_mon_domain - CPUs sharing an Intel-PMT-scoped resctrl monitor resource
+ * @hdr: common header for different domain types
+ */
+struct rdt_perf_pkg_mon_domain {
+ struct rdt_domain_hdr hdr;
+};
+
+static void setup_intel_aet_mon_domain(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos)
+{
+ struct rdt_perf_pkg_mon_domain *d;
+ int err;
+
+ d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
+ if (!d)
+ return;
+
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = r->rid;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ list_add_tail_rcu(&d->hdr.list, add_pos);
+
+ err = resctrl_online_mon_domain(r, &d->hdr);
+ if (err) {
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ kfree(d);
+ }
+}
+
static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
@@ -567,6 +599,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
case RDT_RESOURCE_L3:
setup_l3_mon_domain(cpu, id, r, add_pos);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ setup_intel_aet_mon_domain(cpu, id, r, add_pos);
+ break;
default:
WARN_ON_ONCE(1);
}
@@ -664,6 +699,12 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
synchronize_rcu();
free_l3_mon_domain(hw_dom);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ resctrl_offline_mon_domain(r, hdr);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
+ kfree(container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr));
+ break;
}
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index da71057f3ff4..5e0d1777f162 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4167,6 +4167,8 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
mutex_lock(&rdtgroup_mutex);
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ goto do_mkdir;
if (r->rid != RDT_RESOURCE_L3)
goto out_unlock;
@@ -4187,6 +4189,8 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
+do_mkdir:
+ err = 0;
/*
* If the filesystem is not mounted then only the default resource group
* exists. Creation of its directories is deferred until mount time
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 24/31] fs/resctrl: Add type define for PERF_PKG files
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (22 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 23/31] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-04-29 0:33 ` [PATCH v4 25/31] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
` (6 subsequent siblings)
30 siblings, 0 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Creation of the default info file for monitor resources requires
an RFTYPE_RES_ define and for fflags_from_resource() to map from
the RDT_RESOURCE_PERF_PKG resource to the RFTYPE_RES_PERF_PKG value.
Add the define and case in fflags_from_resource().
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 2 ++
fs/resctrl/rdtgroup.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index aaa74a17257d..623a9fadc18a 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -250,6 +250,8 @@ struct rdtgroup {
#define RFTYPE_DEBUG BIT(10)
+#define RFTYPE_RES_PERF_PKG BIT(11)
+
#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 5e0d1777f162..544fa721e067 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2191,6 +2191,8 @@ static unsigned long fflags_from_resource(struct rdt_resource *r)
case RDT_RESOURCE_MBA:
case RDT_RESOURCE_SMBA:
return RFTYPE_RES_MB;
+ case RDT_RESOURCE_PERF_PKG:
+ return RFTYPE_RES_PERF_PKG;
}
return WARN_ON_ONCE(1);
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 25/31] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (23 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 24/31] fs/resctrl: Add type define for PERF_PKG files Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-04-29 0:33 ` [PATCH v4 26/31] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
` (5 subsequent siblings)
30 siblings, 0 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The RDT_RESOURCE_PERF_PKG resource is not marked as "mon_capable" during
early resctrl initialization. This means that the domain lists for the
resource are not built when the CPU hot plug notifiers are registered.
Mark the resource as mon_capable and call domain_add_cpu_mon() for
each online CPU to build the domain lists in the first call to the
resctrl_arch_pre_mount() hook.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 97fb2001c8d8..9fa4cc66faf4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -761,13 +761,27 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
void resctrl_arch_pre_mount(void)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
static atomic_t only_once;
+ int cpu;
if (atomic_cmpxchg(&only_once, 0, 1))
return;
if (!intel_aet_get_events())
return;
+
+ /*
+ * Late discovery of telemetry events means the domains for the
+ * resource were not built. Do that now.
+ */
+ cpus_read_lock();
+ mutex_lock(&domain_list_lock);
+ r->mon_capable = true;
+ for_each_online_cpu(cpu)
+ domain_add_cpu_mon(cpu, r);
+ mutex_unlock(&domain_list_lock);
+ cpus_read_unlock();
}
enum {
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 26/31] x86/resctrl: Add energy/perf choices to rdt boot option
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (24 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 25/31] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:58 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 27/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
` (4 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Users may want to force either of the telemetry features on
(in the case where they are disabled due to erratum) or off
(in the case that a limited number of RMIDs for a telemetry
feature reduces the number of monitor groups that can be
created.)
Unlike other options that are tied to X86_FEATURE_* flags,
these must be queried by name. Add a function to do that.
Add checks for users who forced either feature off.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
.../admin-guide/kernel-parameters.txt | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 19 +++++++++++++++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 6 ++++++
4 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d9fd26b95b34..4811bc812f0f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5988,7 +5988,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec.
+ mba, smba, bmec, energy, perf.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index dd5fe8a98304..92cbba9d82a8 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -167,6 +167,8 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
+bool rdt_check_option(char *name, bool is_on, bool is_off);
+
#ifdef CONFIG_INTEL_AET_RESCTRL
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9fa4cc66faf4..dc312e24ab87 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -795,6 +795,8 @@ enum {
RDT_FLAG_MBA,
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
+ RDT_FLAG_ENERGY,
+ RDT_FLAG_PERF,
};
#define RDT_OPT(idx, n, f) \
@@ -820,6 +822,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
+ RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
+ RDT_OPT(RDT_FLAG_PERF, "perf", 0),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
@@ -869,6 +873,21 @@ bool rdt_cpu_has(int flag)
return ret;
}
+/* Check if a named option has been forced on, or forced off */
+bool rdt_check_option(char *name, bool is_on, bool is_off)
+{
+ struct rdt_options *o;
+
+ WARN_ON(!(is_on ^ is_off));
+
+ for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
+ if (!strcmp(name, o->name))
+ return (is_on && o->force_on) || (is_off && o->force_off);
+ }
+
+ return false;
+}
+
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt)
{
if (!rdt_cpu_has(X86_FEATURE_BMEC))
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 0bbf991da981..aacaedcc7b74 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -49,6 +49,7 @@ struct pmt_event {
* gleaned from the XML files. Others are set from data
* retrieved from intel_pmt_get_regions_by_feature().
* @pfg: The pmt_feature_group for this event group
+ * @name: Name for this group
* @guid: Unique number per XML description file
* @mmio_size: Number of bytes of mmio registers for this group
* @pkginfo: Per-package MMIO addresses
@@ -57,6 +58,7 @@ struct pmt_event {
*/
struct event_group {
struct pmt_feature_group *pfg;
+ char *name;
int guid;
int mmio_size;
struct mmio_info **pkginfo;
@@ -66,6 +68,7 @@ struct event_group {
/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
static struct event_group energy_0x26696143 = {
+ .name = "energy",
.guid = 0x26696143,
.mmio_size = (576 * 2 + 3) * 8,
.num_events = 2,
@@ -77,6 +80,7 @@ static struct event_group energy_0x26696143 = {
/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
static struct event_group perf_0x26557651 = {
+ .name = "perf",
.guid = 0x26557651,
.mmio_size = (576 * 7 + 3) * 8,
.num_events = 7,
@@ -208,6 +212,8 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
for (int i = 0; i < p->count; i++) {
if ((*peg)->guid == p->regions[i].guid) {
+ if (rdt_check_option((*peg)->name, false, true))
+ return false;
ret = configure_events((*peg), p);
if (ret) {
(*peg)->pfg = no_free_ptr(p);
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 27/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (25 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 26/31] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-05-08 15:59 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 28/31] x86,fs/resctrl: Fix RMID allocation for multiple monitor resources Tony Luck
` (3 subsequent siblings)
30 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There are now three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF. This
is the maximum number of distinct values that can be loaded into the
IA32_PQR_ASSOC MSR. Note that systems with Sub-NUMA Cluster mode enabled
will force scaling down the CPUID enumerated value by the number of SNC
nodes per L3-cache.
2) The number of registers in MMIO space for each event. This
is enumerated in the XML files and is the value placed into
event_group::num_rmids.
3) The number of "h/w counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that
does describe the limitations) feeding to those MMIO registers. This
is enumerated in telemetry_region::num_rmids returned from the call to
intel_pmt_get_regions_by_feature()
Event groups with insufficient "h/w counter" to track all RMIDs are
difficult for users to use, since the system may reassign "h/w counters"
as any time. This means that users cannot reliably collect two consecutive
event counts to compute the rate at which events are occurring.
Ignore such under-resourced event groups unless the user explicitly
requests to enable them using the "rdt=" Linux boot argument.
Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG
resource "num_rmids" value to the smallest of these values to ensure
that all resctrl groups have equal monitor capabilities.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 25 +++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 2 ++
3 files changed, 29 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 92cbba9d82a8..31499bcd2065 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -18,6 +18,8 @@
#define RMID_VAL_UNAVAIL BIT_ULL(62)
+extern int rdt_num_system_rmids;
+
/*
* With the above fields in use 62 bits remain in MSR_IA32_QM_CTR for
* data to be returned. The counter width is discovered from the hardware
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index aacaedcc7b74..eec5eb625f13 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -14,6 +14,7 @@
#include <linux/cleanup.h>
#include <linux/cpu.h>
#include <linux/io.h>
+#include <linux/minmax.h>
#include <linux/resctrl.h>
/* Temporary - delete from final version */
@@ -51,6 +52,7 @@ struct pmt_event {
* @pfg: The pmt_feature_group for this event group
* @name: Name for this group
* @guid: Unique number per XML description file
+ * @num_rmids: Number of RMIDS supported by this group
* @mmio_size: Number of bytes of mmio registers for this group
* @pkginfo: Per-package MMIO addresses
* @num_events: Number of events in this group
@@ -60,6 +62,7 @@ struct event_group {
struct pmt_feature_group *pfg;
char *name;
int guid;
+ int num_rmids;
int mmio_size;
struct mmio_info **pkginfo;
int num_events;
@@ -70,6 +73,7 @@ struct event_group {
static struct event_group energy_0x26696143 = {
.name = "energy",
.guid = 0x26696143,
+ .num_rmids = 576,
.mmio_size = (576 * 2 + 3) * 8,
.num_events = 2,
.evts = {
@@ -82,6 +86,7 @@ static struct event_group energy_0x26696143 = {
static struct event_group perf_0x26557651 = {
.name = "perf",
.guid = 0x26557651,
+ .num_rmids = 576,
.mmio_size = (576 * 7 + 3) * 8,
.num_events = 7,
.evts = {
@@ -214,6 +219,15 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
if ((*peg)->guid == p->regions[i].guid) {
if (rdt_check_option((*peg)->name, false, true))
return false;
+ /*
+ * Ignore event group with insufficient RMIDs unless the
+ * user used the rdt= boot option to specifically ask
+ * for it to be enabled.
+ */
+ if (p->regions[i].num_rmids < rdt_num_system_rmids &&
+ !rdt_check_option((*peg)->name, true, false))
+ return false;
+ (*peg)->num_rmids = p->regions[i].num_rmids;
ret = configure_events((*peg), p);
if (ret) {
(*peg)->pfg = no_free_ptr(p);
@@ -233,11 +247,22 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
*/
bool intel_aet_get_events(void)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
+ struct event_group **eg;
bool ret1, ret2;
ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM);
ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM);
+ for (eg = &known_event_groups[0]; eg < &known_event_groups[NUM_KNOWN_GROUPS]; eg++) {
+ if (!(*eg)->pfg)
+ continue;
+ if (r->num_rmid)
+ r->num_rmid = min(r->num_rmid, (*eg)->num_rmids);
+ else
+ r->num_rmid = (*eg)->num_rmids;
+ }
+
return ret1 || ret2;
}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 04214585824b..7e3a68058b90 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -32,6 +32,7 @@ bool rdt_mon_capable;
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
+int rdt_num_system_rmids;
static int snc_nodes_per_l3_cache = 1;
/*
@@ -354,6 +355,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
+ rdt_num_system_rmids = r->num_rmid;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 28/31] x86,fs/resctrl: Fix RMID allocation for multiple monitor resources
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (26 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 27/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-04-29 0:33 ` [PATCH v4 29/31] fs/resctrl: Add interface for per-resource debug info files Tony Luck
` (2 subsequent siblings)
30 siblings, 0 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The resctrl file system code assumed that the only monitor events were
tied to the RDT_RESOURCE_L3 resource. Also that the number of supported
RMIDs was enumerated during early initialization.
RDT_RESOURCE_PERF_PKG breaks both of those assumptions.
Delay the final enumeration of the number of RMIDs and subsequent
allocation of structures until first mount of the resctrl file system.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 4 +--
arch/x86/kernel/cpu/resctrl/core.c | 8 +++--
fs/resctrl/monitor.c | 48 +++++++++++++-----------------
fs/resctrl/rdtgroup.c | 13 ++++----
4 files changed, 35 insertions(+), 38 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 623a9fadc18a..fb5ae8ba0c17 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -364,7 +364,7 @@ int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
-void resctrl_mon_resource_exit(void);
+void resctrl_dom_data_exit(void);
void mon_event_count(void *info);
@@ -405,7 +405,7 @@ enum resctrl_event_id resctrl_get_mon_event_by_name(char *name);
char *resctrl_mon_event_name(enum resctrl_event_id evt);
-void resctrl_init_mon_events(void);
+int resctrl_init_mon_events(void);
#ifdef CONFIG_RESCTRL_FS_PSEUDO_LOCK
int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index dc312e24ab87..d921f32a1b6c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -112,10 +112,14 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
u32 resctrl_arch_system_num_rmid_idx(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r;
+ int num_rmids = S32_MAX;
+
+ for_each_mon_capable_rdt_resource(r)
+ num_rmids = min(num_rmids, r->num_rmid);
/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
- return r->num_rmid;
+ return num_rmids;
}
struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index f848325591b4..f7a5ffe9be25 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -762,7 +762,7 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
-static int dom_data_init(struct rdt_resource *r)
+static int resctrl_dom_data_init(struct rdt_resource *r)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
u32 num_closid = resctrl_arch_get_num_closid(r);
@@ -770,7 +770,10 @@ static int dom_data_init(struct rdt_resource *r)
int err = 0, i;
u32 idx;
- mutex_lock(&rdtgroup_mutex);
+ /* Are there any mon_capable resources? */
+ if (idx_limit == S32_MAX)
+ return 0;
+
if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
u32 *tmp;
@@ -783,7 +786,7 @@ static int dom_data_init(struct rdt_resource *r)
tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL);
if (!tmp) {
err = -ENOMEM;
- goto out_unlock;
+ goto out;
}
closid_num_dirty_rmid = tmp;
@@ -796,7 +799,7 @@ static int dom_data_init(struct rdt_resource *r)
closid_num_dirty_rmid = NULL;
}
err = -ENOMEM;
- goto out_unlock;
+ goto out;
}
for (i = 0; i < idx_limit; i++) {
@@ -817,14 +820,15 @@ static int dom_data_init(struct rdt_resource *r)
entry = __rmid_entry(idx);
list_del(&entry->list);
-out_unlock:
- mutex_unlock(&rdtgroup_mutex);
+out:
return err;
}
-static void dom_data_exit(struct rdt_resource *r)
+void resctrl_dom_data_exit(void)
{
+ struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+
mutex_lock(&rdtgroup_mutex);
if (!r->mon_capable)
@@ -954,16 +958,21 @@ char *resctrl_mon_event_name(enum resctrl_event_id evt)
* events have been enumerated. Only needs to build the per-resource
* event lists once.
*/
-void resctrl_init_mon_events(void)
+int resctrl_init_mon_events(void)
{
+ struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
enum resctrl_event_id evt;
- struct rdt_resource *r;
static bool only_once;
+ int ret;
if (only_once)
- return;
+ return 0;
only_once = true;
+ ret = resctrl_dom_data_init(r);
+ if (ret)
+ return ret;
+
for_each_mon_capable_rdt_resource(r)
INIT_LIST_HEAD(&r->evt_list);
@@ -973,6 +982,8 @@ void resctrl_init_mon_events(void)
r = resctrl_arch_get_resource(mon_event_all[evt].rid);
list_add_tail(&mon_event_all[evt].list, &r->evt_list);
}
+
+ return ret;
}
/**
@@ -989,16 +1000,6 @@ void resctrl_init_mon_events(void)
*/
int resctrl_mon_resource_init(void)
{
- struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- int ret;
-
- if (!r->mon_capable)
- return 0;
-
- ret = dom_data_init(r);
- if (ret)
- return ret;
-
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
@@ -1017,10 +1018,3 @@ int resctrl_mon_resource_init(void)
return 0;
}
-
-void resctrl_mon_resource_exit(void)
-{
- struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
-
- dom_data_exit(r);
-}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 544fa721e067..195e41eb73fb 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2595,7 +2595,9 @@ static int rdt_get_tree(struct fs_context *fc)
goto out;
}
- resctrl_init_mon_events();
+ ret = resctrl_init_mon_events();
+ if (ret)
+ goto out;
ret = rdtgroup_setup_root(ctx);
if (ret)
@@ -4300,10 +4302,8 @@ int resctrl_init(void)
return ret;
ret = sysfs_create_mount_point(fs_kobj, "resctrl");
- if (ret) {
- resctrl_mon_resource_exit();
+ if (ret)
return ret;
- }
ret = register_filesystem(&rdt_fs_type);
if (ret)
@@ -4336,7 +4336,6 @@ int resctrl_init(void)
cleanup_mountpoint:
sysfs_remove_mount_point(fs_kobj, "resctrl");
- resctrl_mon_resource_exit();
return ret;
}
@@ -4363,7 +4362,7 @@ static bool resctrl_online_domains_exist(void)
* When called by the architecture code, all CPUs and resctrl domains must be
* offline. This ensures the limbo and overflow handlers are not scheduled to
* run, meaning the data structures they access can be freed by
- * resctrl_mon_resource_exit().
+ * resctrl_dom_data_exit().
*
* After this function has returned, the architecture code should return an
* from all resctrl_arch_ functions that can do this.
@@ -4390,5 +4389,5 @@ void resctrl_exit(void)
* it can be used to umount resctrl.
*/
- resctrl_mon_resource_exit();
+ resctrl_dom_data_exit();
}
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 29/31] fs/resctrl: Add interface for per-resource debug info files
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (27 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 28/31] x86,fs/resctrl: Fix RMID allocation for multiple monitor resources Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-04-29 0:33 ` [PATCH v4 30/31] x86/resctrl: Add info/PERF_PKG_MON/status file Tony Luck
2025-04-29 0:33 ` [PATCH v4 31/31] x86/resctrl: Update Documentation for package events Tony Luck
30 siblings, 0 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There are some status registers on each of the telemetry aggregators.
Users may need to view these to understand unexpected event counter
values.
Add the file system support for a "status" file in each mon_capable
resource directory in the "info" directory. This will only be present if
the file system is mounted with the "-o debug" option. It will only have
content for resources that provide a rdt_resource::info_debug() routine.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 1 +
fs/resctrl/rdtgroup.c | 18 ++++++++++++++++++
2 files changed, 19 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 3ae50b947a99..675cfbe3e6c6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -299,6 +299,7 @@ struct rdt_resource {
struct list_head evt_list;
unsigned int mbm_cfg_mask;
bool cdp_capable;
+ void (*info_debug)(struct seq_file *s);
};
/*
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 195e41eb73fb..5948e279b44c 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1022,6 +1022,17 @@ static int rdt_shareable_bits_show(struct kernfs_open_file *of,
return 0;
}
+static int rdtgroup_info_debug_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
+
+ if (r->info_debug)
+ r->info_debug(s);
+
+ return 0;
+}
+
/*
* rdt_bit_usage_show - Display current usage of resources
*
@@ -1983,6 +1994,13 @@ static struct rftype res_common_files[] = {
.seq_show = rdtgroup_closid_show,
.fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG,
},
+ {
+ .name = "status",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_info_debug_show,
+ .fflags = RFTYPE_MON_INFO | RFTYPE_DEBUG,
+ },
};
static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 30/31] x86/resctrl: Add info/PERF_PKG_MON/status file
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (28 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 29/31] fs/resctrl: Add interface for per-resource debug info files Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
2025-04-29 0:33 ` [PATCH v4 31/31] x86/resctrl: Update Documentation for package events Tony Luck
30 siblings, 0 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each telemetry aggregator provides three status registers at the top
end of MMIO space after all the per-RMID per-event counters:
agg_data_loss_count: This counts the number of times that this aggregator
failed to accumulate a counter value supplied by a CPU core.
agg_data_loss_timestamp: This is a "timestamp" from a free running
25MHz uncore timer indicating when the most recent data loss occurred.
last_update_timestamp: Another 25MHz timestamp indicating when the
most recent counter update was successfully applied.
When the resctrl file system is mounted with the "-o debug" option
display the values of each of these status registers for each aggregator
in each enabled event group.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 30 +++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index eec5eb625f13..80a8af3c4711 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -16,6 +16,7 @@
#include <linux/io.h>
#include <linux/minmax.h>
#include <linux/resctrl.h>
+#include <linux/seq_file.h>
/* Temporary - delete from final version */
#include "fake_intel_aet_features.h"
@@ -241,6 +242,33 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
return false;
}
+static void show_debug(struct seq_file *s, struct event_group *e, int pkg, int instance)
+{
+ u64 *info __iomem = e->pkginfo[pkg]->addrs[instance] + e->mmio_size;
+
+ /* Information registers are the last three qwords in MMIO space */
+ seq_printf(s, "%s %d:%d agg_data_loss_count = %llu\n", e->name, pkg, instance,
+ readq(&info[-3]));
+ seq_printf(s, "%s %d:%d agg_data_loss_timestamp = %llu\n", e->name, pkg, instance,
+ readq(&info[-2]));
+ seq_printf(s, "%s %d:%d last_update_timestamp = %llu\n", e->name, pkg, instance,
+ readq(&info[-1]));
+}
+
+static void info_debug(struct seq_file *s)
+{
+ int num_pkgs = topology_max_packages();
+ struct event_group **eg;
+
+ for (eg = &known_event_groups[0]; eg < &known_event_groups[NUM_KNOWN_GROUPS]; eg++) {
+ if (!(*eg)->pfg)
+ continue;
+ for (int i = 0; i < num_pkgs; i++)
+ for (int j = 0; j < (*eg)->pkginfo[i]->count; j++)
+ show_debug(s, *eg, i, j);
+ }
+}
+
/*
* Ask OOBMSM discovery driver for all the RMID based telemetry groups
* that it supports.
@@ -263,6 +291,8 @@ bool intel_aet_get_events(void)
r->num_rmid = (*eg)->num_rmids;
}
+ r->info_debug = info_debug;
+
return ret1 || ret2;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v4 31/31] x86/resctrl: Update Documentation for package events
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
` (29 preceding siblings ...)
2025-04-29 0:33 ` [PATCH v4 30/31] x86/resctrl: Add info/PERF_PKG_MON/status file Tony Luck
@ 2025-04-29 0:33 ` Tony Luck
30 siblings, 0 replies; 72+ messages in thread
From: Tony Luck @ 2025-04-29 0:33 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each "mon_data" directory is now divided between L3 events and package
events.
The "info/PERF_PKG_MON" directory contains parameters for perf events.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Documentation/filesystems/resctrl.rst | 53 ++++++++++++++++++++++-----
1 file changed, 43 insertions(+), 10 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index c7949dd44f2f..a452fd54b3ae 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -167,7 +167,7 @@ with respect to allocation:
bandwidth percentages are directly applied to
the threads running on the core
-If RDT monitoring is available there will be an "L3_MON" directory
+If L3 monitoring is available there will be an "L3_MON" directory
with the following files:
"num_rmids":
@@ -261,6 +261,23 @@ with the following files:
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
+If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of telemetry RMIDs supported. If this is different
+ from the number reported in the L3_MON directory the limit
+ on the number of "CTRL_MON" + "MON" directories is the
+ minimum of the values.
+
+"mon_features":
+ Lists the telemetry monitoring events that are enabled on this system.
+
+When the filesystem is mounted with the debug option each subdirectory
+for a monitor resource of the "info" directory will contain a "status"
+file. Resources may use this to supply debug information about the status
+of the hardware implementing the resource.
+
Finally, in the top level of the "info" directory there is a file
named "last_cmd_status". This is reset with every "command" issued
via the file system (making new directories or writing to any of the
@@ -366,15 +383,31 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:
"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
- "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
- files provide a read out of the current value of the event for
- all tasks in the group. In CTRL_MON groups these files provide
- the sum for all tasks in the CTRL_MON group and all tasks in
- MON groups. Please see example section for more details on usage.
+ This contains a set of directories, one for each instance
+ of an L3 cache, or of a processor package. The L3 cache
+ directories are named "mon_L3_00", "mon_L3_01" etc. The
+ package directories "mon_PERF_PKG_00", "mon_PERF_PKG_01" etc.
+
+ Within each directory there is one file per event. In
+ the L3 directories: "llc_occupancy", "mbm_total_bytes",
+ and "mbm_local_bytes". In the PERF_PKG directories: "core_energy",
+ "activity", etc.
+
+ "core_energy" reports a floating point number for the energy
+ (in Joules) used by cores for each RMID.
+
+ "activity" also reports a floating point value (in Farads).
+ This provides an estimate of work done independent of the
+ frequency that the cores used for execution.
+
+ All other events report decimal integer values.
+
+ In a MON group these files provide a read out of the current
+ value of the event for all tasks in the group. In CTRL_MON groups
+ these files provide the sum for all tasks in the CTRL_MON group
+ and all tasks in MON groups. Please see example section for more
+ details on usage.
+
On systems with Sub-NUMA Cluster (SNC) enabled there are extra
directories for each node (located within the "mon_L3_XX" directory
for the L3 cache they occupy). These are named "mon_sub_L3_YY"
--
2.48.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface
2025-04-29 0:33 ` [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface Tony Luck
@ 2025-04-30 23:02 ` Luck, Tony
2025-05-08 3:33 ` Reinette Chatre
1 sibling, 0 replies; 72+ messages in thread
From: Luck, Tony @ 2025-04-30 23:02 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
On Mon, Apr 28, 2025 at 05:33:32PM -0700, Tony Luck wrote:
> Real version is coming soon ... this is here so the remaining parts
> will build (and run ... assuming a 2 socket system that supports RDT
> monitoring ... only missing part is that the event counters just
> report fixed values).
Real OOBMSM discovery patches have now been posted:
https://lore.kernel.org/all/20250430212106.369208-1-david.e.box@linux.intel.com/
-Tony
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable
2025-04-29 0:33 ` [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable Tony Luck
@ 2025-05-08 3:28 ` Reinette Chatre
2025-05-08 18:32 ` Luck, Tony
0 siblings, 1 reply; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:28 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> The fs/arch boundary is a little muddy for adding new monitor features.
It is not possible to accurately interpret what is meant with "little muddy".
Please add specific information that can be verified/reasoned about.
>
> Clean it up by making the mon_evt structure the source of all information
> about each event. In this case replace the bitmap of enabled monitor
> features with an "enabled" bit in the mon_evt structure.
bit -> boolean?
>
> Change architecture code to inform file system code which events are
> available on a system with resctrl_enable_mon_event().
(nit: no need to mention that a patch changes code, it should be implied.)
This could be, "An architecture uses resctrl_enable_mon_event() to inform
resctrl fs which events are enabled on the system."
(I think we need to be cautious about the "available" vs "enabled"
distinction.)
>
> Replace the event and architecture specific:
> resctrl_arch_is_llc_occupancy_enabled()
> resctrl_arch_is_mbm_total_enabled()
> resctrl_arch_is_mbm_local_enabled()
> functions with calls to resctrl_is_mon_event_enabled() with the
> appropriate QOS_L3_* enum resctrl_event_id.
No mention or motivation for the new array. I think the new array is an
improvement and now it begs the question whether rdt_resource::evt_list is
still needed? It seems to me that any usage of rdt_resource::evt_list can
use the new mon_event_all[] instead?
With struct mon_evt being independent like before this
patch it almost seems as though it prepared for multiple resources to
support the same event (do you know history here?). This appears to already
be thwarted by rdt_mon_features though ... although theoretically it could
have been "rdt_l3_mon_features".
Even so, with patch #4 adding the resource ID all event information is
centralized. Only potential issue may be if multiple resources use the
same event ... but since the existing event IDs already have resource
name embedded this does not seem to be of concern?
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
...
> @@ -866,14 +879,13 @@ static struct mon_evt mbm_local_event = {
> */
> static void l3_mon_evt_init(struct rdt_resource *r)
> {
> + enum resctrl_event_id evt;
> +
> INIT_LIST_HEAD(&r->evt_list);
>
> - if (resctrl_arch_is_llc_occupancy_enabled())
> - list_add_tail(&llc_occupancy_event.list, &r->evt_list);
> - if (resctrl_arch_is_mbm_total_enabled())
> - list_add_tail(&mbm_total_event.list, &r->evt_list);
> - if (resctrl_arch_is_mbm_local_enabled())
> - list_add_tail(&mbm_local_event.list, &r->evt_list);
> + for (evt = 0; evt < QOS_NUM_EVENTS; evt++)
> + if (mon_event_all[evt].enabled)
> + list_add_tail(&mon_event_all[evt].list, &r->evt_list);
> }
This hunk can create confusion with it adding "all enabled events" to
a single resource. I understand that at this point only L3 supports monitoring
and this works ok, but in the context of this work it creates a caveat early
in series that needs to be fixed later (patch #4). This wrangling becomes
unnecessary if removing rdt_resource::evt_list.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events
2025-04-29 0:33 ` [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events Tony Luck
@ 2025-05-08 3:30 ` Reinette Chatre
2025-05-09 15:02 ` Peter Newman
1 sibling, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:30 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> There's a rule in computer programming that objects appear zero,
> once, or many times. So code accordingly.
>
> There are two MBM events and resctrl is coded with a lot of
>
> if (local)
> do one thing
> if (total)
> do a different thing
>
> Change the rdt_ctrl_domain and rdt_hw_mon_domain structures to hold
rdt_ctrl_domain -> rdt_mon_domain
> arrays of pointers to per event data instead of explicit fields for
> total and local bandwidth.
>
> Simplify the code by coding for many events using loops on
"Simplify the code by coding" seems redundant, maybe just "Simplify"?
> which are enabled.
>
> Move resctrl_is_mbm_event() to <linux/resctrl.h> so it
> can be used more widely. Also provide a for_each_mbm_event()
> helper macro.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 15 +++++---
> include/linux/resctrl_types.h | 3 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 6 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 38 ++++++++++----------
> arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++++++--------
> fs/resctrl/monitor.c | 13 ++++---
> fs/resctrl/rdtgroup.c | 48 ++++++++++++--------------
> 7 files changed, 84 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 3c5d111aae65..cef9b0ed984c 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -161,8 +161,7 @@ struct rdt_ctrl_domain {
> * @hdr: common header for different domain types
> * @ci: cache info for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> - * @mbm_total: saved state for MBM total bandwidth
> - * @mbm_local: saved state for MBM local bandwidth
> + * @mbm_states: saved state for each QOS MBM event
> * @mbm_over: worker to periodically read MBM h/w counters
> * @cqm_limbo: worker to periodically read CQM h/w counters
> * @mbm_work_cpu: worker CPU for MBM h/w counters
> @@ -172,8 +171,7 @@ struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> struct cacheinfo *ci;
> unsigned long *rmid_busy_llc;
> - struct mbm_state *mbm_total;
> - struct mbm_state *mbm_local;
> + struct mbm_state *mbm_states[QOS_NUM_MBM_EVENTS];
> struct delayed_work mbm_over;
> struct delayed_work cqm_limbo;
> int mbm_work_cpu;
> @@ -376,6 +374,15 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid);
> bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
>
> +static inline bool resctrl_is_mbm_event(int e)
This looks like a good time to change the parameter type to enum resctrl_event_id.
> +{
> + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> + e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +#define for_each_mbm_event(evt) \
> + for (evt = QOS_L3_MBM_TOTAL_EVENT_ID; evt <= QOS_L3_MBM_LOCAL_EVENT_ID; evt++)
> +
> /**
> * resctrl_arch_mon_event_config_write() - Write the config for an event.
> * @config_info: struct resctrl_mon_config_info describing the resource, domain
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index a25fb9c4070d..5ef14a24008c 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -47,4 +47,7 @@ enum resctrl_event_id {
> QOS_NUM_EVENTS,
> };
>
> +#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
To prevent possible future confusion/churn while making existing code easier to read,
could this be "QOS_NUM_L3_MBM_EVENTS"?
> +#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
Naming nit: I think "MBM event idx" is close enough to "MBM event id" to cause confusion.
How about "MBM_STATE_IDX"?
> +
> #endif /* __LINUX_RESCTRL_TYPES_H */
...
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index fda579251dba..bf7fde07846b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,18 +160,21 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
> u32 rmid,
> enum resctrl_event_id eventid)
> {
> + struct arch_mbm_state *state;
> +
> switch (eventid) {
> - case QOS_L3_OCCUP_EVENT_ID:
> - return NULL;
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &hw_dom->arch_mbm_total[rmid];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &hw_dom->arch_mbm_local[rmid];
> default:
> /* Never expect to get here */
> WARN_ON_ONCE(1);
> + fallthrough;
> + case QOS_L3_OCCUP_EVENT_ID:
> return NULL;
> + case QOS_L3_MBM_TOTAL_EVENT_ID:
> + case QOS_L3_MBM_LOCAL_EVENT_ID:
> + state = hw_dom->arch_mbm_states[MBM_EVENT_IDX(eventid)];
Please add a "break" here. Although, I find the resulting get_mbm_state() to accomplish
the same in a much simpler way. Why did get_arch_mbm_state() and get_mbm_state() end up
looking so different?
> }
> +
> + return state ? &state[rmid] : NULL;
> }
>
> void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
...
> @@ -346,15 +346,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> u32 rmid, enum resctrl_event_id evtid)
> {
> u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> + struct mbm_state *states;
>
> - switch (evtid) {
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &d->mbm_total[idx];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &d->mbm_local[idx];
> - default:
> + if (!resctrl_is_mbm_event(evtid))
> return NULL;
> - }
> +
> + states = d->mbm_states[MBM_EVENT_IDX(evtid)];
> +
> + return states ? &states[idx] : NULL;
> }
>
get_mbm_state() above for comparison.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 03/31] fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}()
2025-04-29 0:33 ` [PATCH v4 03/31] fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}() Tony Luck
@ 2025-05-08 3:31 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:31 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> These routines hard-code the two legacy mbm events.
No context, just a cryptic problem description.
>
> Change to allow for other mbm events in the future.
What is changed? How are other MBM events allowed in future?
Something to start with (incomplete and needs improvement still):
rdtgroup_mba_mbps_event_{show,write}() respectively shows and stores
which memory bandwidth event is used as input to the software feedback
loop that keeps memory bandwidth below the value specified in the
schemata file.
rdtgroup_mba_mbps_event_{show,write}() hard codes the two legacy MBM
events, MBM total bytes and MBM local bytes.
(Needs explanation why the hardcoding is a problem and how this is fixed.
Since this series is not adding any new MBM events it is unclear what
motivates this "in the future" fix as part of this telemetry work.)
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 04/31] fs/resctrl: Change how and when events are initialized
2025-04-29 0:33 ` [PATCH v4 04/31] fs/resctrl: Change how and when events are initialized Tony Luck
@ 2025-05-08 3:31 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:31 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> Existing code assumes that all monitor events are associated with
> the RDT_RESOURCE_L3 resource. Also that all event enumeration is
> complete during early resctrl initialization. Neither of these
> assumptions remain true for new events.
>
> Each resource must include a list of enabled events that is used
> to add appropriately named files when creating mon_data directories
> and to for the contents of "info/{resource}_MON/mon_features" file.
>
> Move the building of enabled event lists for each resource from
> resctrl_mon_resource_init() to rdt_get_tree() to delay it until
> mount of the resctrl file system.
>
> Add a new field to struct mon_evt to record which resource each
> event is associated with so that events are added to the correct
> resource event list.
>
As mentioned in comments to patch #1 I think the array it introduces
can be used instead of rdt_resource::evt_list, simplifying this significantly.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events
2025-04-29 0:33 ` [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
@ 2025-05-08 3:32 ` Reinette Chatre
2025-05-10 9:58 ` Chen, Yu C
1 sibling, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:32 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> Intel RMID based telemetry events are counted by each CPU core
> and then aggregated by one or more per-socket micro controllers.
> Enumeration support is provided by the Intel PMT subsystem.
>
> N.B. Patches for the Intel PMT system are still in progress.
> They will define an INTEL_PMT_DISCOVERY Kconfig symbol that
> will be one of the dependencies. This is commented out for
> now. Final version will include this dependency.
>
> arch/x86 selects this option based on:
Portion of changelog seems to be missing ... what does "this option" refer to?
>
> X86_64: Counter registers are in MMIO space. There is no readq()
> function on 32-bit. Emulation is possible with readl(), but there
> are races. Running 32-bit kernels on systems that support this
> feature seems pointless.
This seems to be the primary dependency that requires the use of
a Kconfig symbol.
>
> CPU_SUP_INTEL: It is an Intel specific feature.
While this is an Intel specific feature it looks to me as though
intel_pmt_get_regions_by_feature() will return -ENODEV if it is not supported.
I think there is benefit to have this code compiled as much as
possible. An alternative to this CPU_SUP_INTEL dependency may be to add a
vendor check in the pre-mount callback to match the other Intel vs AMD,
but this may not be necessary?
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/Kconfig | 1 +
> drivers/platform/x86/intel/pmt/Kconfig | 7 +++++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5a09acf41c8e..19107fdb4264 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -508,6 +508,7 @@ config X86_CPU_RESCTRL
> bool "x86 CPU resource control support"
> depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> depends on MISC_FILESYSTEMS
> + select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL)
> select ARCH_HAS_CPU_RESCTRL
> select RESCTRL_FS
> select RESCTRL_FS_PSEUDO_LOCK
> diff --git a/drivers/platform/x86/intel/pmt/Kconfig b/drivers/platform/x86/intel/pmt/Kconfig
> index e916fc966221..3a8ce39d1004 100644
> --- a/drivers/platform/x86/intel/pmt/Kconfig
> +++ b/drivers/platform/x86/intel/pmt/Kconfig
> @@ -38,3 +38,10 @@ config INTEL_PMT_CRASHLOG
>
> To compile this driver as a module, choose M here: the module
> will be called intel_pmt_crashlog.
> +
> +config INTEL_AET_RESCTRL
> + depends on INTEL_PMT_TELEMETRY # && INTEL_PMT_DISCOVERY
> + bool
> + help
> + Architecture config should "select" this option to enable
> + support for RMID telemetry events in the resctrl file system.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface
2025-04-29 0:33 ` [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface Tony Luck
2025-04-30 23:02 ` Luck, Tony
@ 2025-05-08 3:33 ` Reinette Chatre
1 sibling, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:33 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
> new file mode 100644
> index 000000000000..22b7c02a538c
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
> @@ -0,0 +1,95 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/cleanup.h>
> +#include <linux/minmax.h>
> +#include <linux/slab.h>
> +#include "fake_intel_aet_features.h"
> +#include <linux/intel_vsec.h>
> +#include <linux/resctrl.h>
> +
> +#include "internal.h"
> +
> +/*
> + * Amount of memory for each fake MMIO space
> + * Magic numbers here match values for XML ID 0x26696143 and 0x26557651
> + * 576: Number of RMIDs
> + * 2: Energy events in 0x26557651
> + * 7: Perf events in 0x26696143
> + * 3: Qwords for status counters after the event counters
> + * 8: Bytes for each counter
> + */
Thanks for adding the explanations. This does not answer the question from
https://lore.kernel.org/lkml/2897fc2a-8977-4415-ae6d-bd0002874b3a@intel.com/
though.
It looks like this sample interface is created to present the scenario where
the energy events do not have sufficient "counters" to support the number of
RMIDs in the MMIO space? It would make this work much easier to review if
these quirks are documented or at least answer the questions during review.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 07/31] x86,fs/resctrl: Improve domain type checking
2025-04-29 0:33 ` [PATCH v4 07/31] x86,fs/resctrl: Improve domain type checking Tony Luck
@ 2025-05-08 3:36 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:36 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
>
> +static inline bool check_domain_header(struct rdt_domain_hdr *hdr,
> + enum resctrl_domain_type type,
> + enum resctrl_res_level rid)
> +{
> + return !!WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
> +}
> +
Please name the function to make the resulting code easier to read. With
a name like "check_domain_header()" it is not obvious what a return of "true"
or "false" means. In this implementation "true" means that the header is
*invalid* so a "pass" of this check means that the header is invalid? This
sounds very confusing to me.
Names like "domain_header_valid()", or "domain_header_is_valid()" makes
the resulting code much easier to understand. Please do make it a goal
to make this work easy to understand, this series is complex.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 08/31] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
2025-04-29 0:33 ` [PATCH v4 08/31] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
@ 2025-05-08 3:37 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:37 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> To prepare for additional types of monitoring domains, move all the
> L3 specific initialization into a helper function.
Please make this specific. "L3 specific initialization" covers quite a
bit ... from the L3 resource control and monitoring enumeration done in
arch code to the per domain initialization done during CPU online for
control and monitoring. Small change like "L3 resource monitoring domain
initialization" would already be more descriptive, but please feel free
to improve.
>
> Rename several functions to mark that they are specific to the L3 path.
>
> arch_mon_domain_online -> arch_l3_mon_domain_online
> mon_domain_free -> free_l3_mon_domain
Wouldn't "l3_mon_domain_free()" be better match for the "online" variant?
I think it will be helpful to reviewer to mention that the new
"helper function" is named setup_l3_mon_domain() (l3_mon_domain_setup()?)
and is the partner of the renamed "free" function.
> arch_mon_domain_online -> arch_l3_mon_domain_online
duplicate
> domain_setup_mon_state -> domain_setup_l3_mon_state
See "Function references in changelogs" in
Documentation/process/maintainer-tip.rst
>
> resctrl_online_mon_domain() is going to share some code with new
> reources, so keeps the same name, but include a check for
reources -> resources
keeps -> keep
> RDT_RESOURCE_L3.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 09/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-04-29 0:33 ` [PATCH v4 09/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-05-08 3:37 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:37 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> The RDT_RESOURCE_L3 resource carries a lot of state in the domain
> structures which needs to be dealt with when a domain is taken offline
> by removing the last CPU in the domain.
>
> Refactor so all the L3 processing is separated from general actions of
> clearing the CPU bit in the mask and removing directories from mon_data.
Why? Always please follow the tip guidance wrt changelogs: Context, Problem, Solution.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 10/31] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr
2025-04-29 0:33 ` [PATCH v4 10/31] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
@ 2025-05-08 3:38 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:38 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> Functions that don't need the internal details of the rdt_mon_domain
> can operate on just the rdt_domain_hdr.
Please add a bit more detail in the context:
"just the rdt_domain_hdr" -> "just the rdt_domain_hdr within" or "just rdt_mon_domain::rdt_domain_hdr".
>
> Add sanity checks where container_of() is used to find the surrounding
> domain structure that hdr has the expected type.
How is reader expected to interpret "hdr"?
>
> Simplify code that uses "d->hdr." to "hdr->" where possible.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 4 +-
> arch/x86/kernel/cpu/resctrl/core.c | 19 +++----
> fs/resctrl/rdtgroup.c | 82 +++++++++++++++++++++---------
> 3 files changed, 68 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index e700f58b5af5..bb55c449adc4 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -444,9 +444,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type type);
> int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_online_cpu(unsigned int cpu);
> void resctrl_offline_cpu(unsigned int cpu);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 525439029865..9c78828ae32f 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -460,7 +460,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> return;
> d = container_of(hdr, struct rdt_ctrl_domain, hdr);
Is the above container_of() still needed?
>
> - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + cpumask_set_cpu(cpu, &hdr->cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> return;
> @@ -524,7 +524,7 @@ static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct
>
> list_add_tail_rcu(&d->hdr.list, add_pos);
>
> - err = resctrl_online_mon_domain(r, d);
> + err = resctrl_online_mon_domain(r, &d->hdr);
> if (err) {
> list_del_rcu(&d->hdr.list);
> synchronize_rcu();
> @@ -537,7 +537,6 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> struct list_head *add_pos = NULL;
> struct rdt_domain_hdr *hdr;
> - struct rdt_mon_domain *d;
>
> lockdep_assert_held(&domain_list_lock);
>
> @@ -549,11 +548,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>
> hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
> if (hdr) {
> - if (check_domain_header(hdr, RESCTRL_MON_DOMAIN, r->rid))
> - return;
> - d = container_of(hdr, struct rdt_mon_domain, hdr);
> -
> - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
It is not clear to me why changes to domain_add_cpu_ctrl() and domain_add_cpu_mon()
do not match in this regard.
> + cpumask_set_cpu(cpu, &hdr->cpu_mask);
> return;
> }
>
> @@ -603,9 +598,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> hw_dom = resctrl_to_arch_ctrl_dom(d);
>
> cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
Above also seems to be candidate for the intended simplification?
> - if (cpumask_empty(&d->hdr.cpu_mask)) {
> + if (cpumask_empty(&hdr->cpu_mask)) {
> resctrl_offline_ctrl_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> + list_del_rcu(&hdr->list);
> synchronize_rcu();
>
> /*
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 11/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-04-29 0:33 ` [PATCH v4 11/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-05-08 3:39 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:39 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> These structures have generic names, but are only used for L3 monitor
> events.
>
> Rename:
Please add information about why the rename is needed.
> rdt_mon_domain -> rdt_l3_mon_domain
> rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 12 ++++----
> arch/x86/kernel/cpu/resctrl/internal.h | 12 ++++----
> fs/resctrl/internal.h | 12 ++++----
> arch/x86/kernel/cpu/resctrl/core.c | 14 ++++-----
> arch/x86/kernel/cpu/resctrl/monitor.c | 18 ++++++------
> fs/resctrl/ctrlmondata.c | 6 ++--
> fs/resctrl/monitor.c | 28 +++++++++---------
> fs/resctrl/rdtgroup.c | 40 +++++++++++++-------------
> 8 files changed, 71 insertions(+), 71 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index bb55c449adc4..cd7881313d4e 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -166,7 +166,7 @@ struct rdt_ctrl_domain {
> };
>
> /**
> - * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
> + * struct rdt_l3_mon_domain - group of CPUs sharing a resctrl monitor resource
> * @hdr: common header for different domain types
> * @ci: cache info for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> @@ -176,7 +176,7 @@ struct rdt_ctrl_domain {
> * @mbm_work_cpu: worker CPU for MBM h/w counters
> * @cqm_work_cpu: worker CPU for CQM h/w counters
> */
> -struct rdt_mon_domain {
> +struct rdt_l3_mon_domain {
> struct rdt_domain_hdr hdr;
> struct cacheinfo *ci;
> unsigned long *rmid_busy_llc;
> @@ -335,7 +335,7 @@ struct resctrl_cpu_defaults {
>
> struct resctrl_mon_config_info {
> struct rdt_resource *r;
> - struct rdt_mon_domain *d;
> + struct rdt_l3_mon_domain *d;
> u32 evtid;
> u32 mon_config;
Please adjust alignment of the other members to match.
> };
> @@ -475,7 +475,7 @@ void resctrl_offline_cpu(unsigned int cpu);
> * Return:
> * 0 on success, or -EIO, -EINVAL etc on error.
> */
> -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
> u32 closid, u32 rmid, enum resctrl_event_id eventid,
> u64 *val, void *arch_mon_ctx);
>
> @@ -522,7 +522,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
> *
> * This can be called from any CPU.
> */
> -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
> u32 closid, u32 rmid,
> enum resctrl_event_id eventid);
>
> @@ -535,7 +535,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> *
> * This can be called from any CPU.
> */
> -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
>
> /**
> * resctrl_arch_reset_all_ctrls() - Reset the control for each CLOSID to its
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index b563406b4996..83b20e6b25d7 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -51,15 +51,15 @@ struct rdt_hw_ctrl_domain {
> };
>
> /**
> - * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> + * struct rdt_hw_l3_mon_domain - Arch private attributes of a set of CPUs that share
> * a resource for a monitor function
> * @d_resctrl: Properties exposed to the resctrl file system
> * @arch_mbm_states: arch private state for each MBM event
> *
> * Members of this structure are accessed via helpers that provide abstraction.
> */
> -struct rdt_hw_mon_domain {
> - struct rdt_mon_domain d_resctrl;
> +struct rdt_hw_l3_mon_domain {
> + struct rdt_l3_mon_domain d_resctrl;
> struct arch_mbm_state *arch_mbm_states[QOS_NUM_MBM_EVENTS];
Please adjust alignment.
> };
>
> @@ -68,9 +68,9 @@ static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctr
> return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
> }
>
> -static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
> +static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3_mon_domain *r)
> {
> - return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
> + return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
> }
>
> /**
> @@ -122,7 +122,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
>
> extern struct rdt_hw_resource rdt_resources_all[];
>
> -void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
>
> /* CPUID.(EAX=10H, ECX=ResID=1).EAX */
> union cpuid_0x10_1_eax {
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index b69170760316..759768e2a2a8 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -129,7 +129,7 @@ struct mon_data {
> struct rmid_read {
> struct rdtgroup *rgrp;
> struct rdt_resource *r;
> - struct rdt_mon_domain *d;
> + struct rdt_l3_mon_domain *d;
> enum resctrl_event_id evtid;
> bool first;
> struct cacheinfo *ci;
Alignment. Please check all changed structures.
...
> @@ -618,9 +618,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> - struct rdt_hw_mon_domain *hw_dom;
> + struct rdt_hw_l3_mon_domain *hw_dom;
> struct rdt_domain_hdr *hdr;
> - struct rdt_mon_domain *d;
> + struct rdt_l3_mon_domain *d;
>
Please fix reverse fir ordering. Looking further there are multiple instances, please check
entire patch.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-29 0:33 ` [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
@ 2025-05-08 3:54 ` Reinette Chatre
2025-05-13 3:19 ` Chen, Yu C
1 sibling, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 3:54 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
How about: "fs/resctrl: Handle events that can be read from any CPU"
On 4/28/25 5:33 PM, Tony Luck wrote:
> Resctrl file system code was built with the assumption that monitor
> events can only be read from a CPU in the cpumast_t set for each
cpumast_t -> cpumask_t
> domain. This was true for x86 events accessed with an MSR interface,
> but may not be true for other access methods such as MMIO.
Please separate context and problem description into separate paragraphs.
>
> Add a flag to each instance of struct mon_evt that can be set by
> architecture code to indicate there is no restriction on which
> CPU can read the event counter.
>
> Change struct mon_data and struct rmid_read to have a pointer to
> the struct mon_evt instead of the event id.
>
> Add an extra argument to resctrl_enable_mon_event() so architecture
> code can indicate which events can be read on any CPU when enabling
> the event.
>
> Bypass all the smp_call*() code for events that can be read on any CPU
> and call mon_event_count() directly from mon_event_read().
>
> Skip checks in __mon_event_count() that the read is being done from
> a CPU in the correct domain or cache scope.
hmmm ... this patch is bundling quite a few changes. Most are related
to the stated goal but I think separating out the change to
struct mon_data and struct rmid_read to have a pointer to struct mon_evt
will make it easier to recognize what it takes to support the
stated requirement of needing to support this new style of events.
...
> @@ -570,7 +575,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> struct rdt_domain_hdr *hdr;
> struct rmid_read rr = {0};
> struct rdt_l3_mon_domain *d;
> - u32 resid, evtid, domid;
> + struct mon_evt *evt;
> + u32 resid, domid;
Please fix reverse fir.
> struct rdtgroup *rdtgrp;
> struct rdt_resource *r;
> struct mon_data *md;
> @@ -590,7 +596,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>
> resid = md->rid;
> domid = md->domid;
> - evtid = md->evtid;
> + evt = md->evt;
> r = resctrl_arch_get_resource(resid);
>
> if (md->sum) {
> @@ -604,7 +610,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> if (d->ci->id == domid) {
> rr.ci = d->ci;
> mon_event_read(&rr, r, NULL, rdtgrp,
> - &d->ci->shared_cpu_map, evtid, false);
> + &d->ci->shared_cpu_map, evt, false);
> goto checkresult;
> }
> }
> @@ -621,7 +627,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> goto out;
> }
> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
> + mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evt, false);
> }
>
> checkresult:
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 19cba29452b7..e903d3c076ee 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -365,19 +365,19 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> u64 tval = 0;
>
> if (rr->first) {
> - resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
> - m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
> + resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evt->evtid);
> + m = get_mbm_state(rr->d, closid, rmid, rr->evt->evtid);
> if (m)
> memset(m, 0, sizeof(struct mbm_state));
> return 0;
> }
>
> if (rr->d) {
> - /* Reading a single domain, must be on a CPU in that domain. */
> - if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
> + /* Reading a single domain, must usually be on a CPU in that domain. */
No need to use vague "usually" when it is very specific that it must be on a CPU in that
domain unless it is an event that can be read from any CPU.
> + if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
> return -EINVAL;
> rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
> - rr->evtid, &tval, rr->arch_mon_ctx);
> + rr->evt->evtid, &tval, rr->arch_mon_ctx);
> if (rr->err)
> return rr->err;
>
> @@ -387,7 +387,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> }
>
> /* Summing domains that share a cache, must be on a CPU for that cache. */
missed comment change
> - if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
> + if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
> return -EINVAL;
>
> /*
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-04-29 0:33 ` [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats Tony Luck
@ 2025-05-08 15:49 ` Reinette Chatre
2025-05-08 20:28 ` Luck, Tony
0 siblings, 1 reply; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:49 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
shortlog nit: "fs/resctrl: Support additional monitor event display formats"
On 4/28/25 5:33 PM, Tony Luck wrote:
> Resctrl was written with the assumption that all monitor events
> can be displayed as unsigned decimal integers.
>
> Some telemetry events provide greater precision where architecture code
> uses a fixed point format with 18 binary places.
>
> Add a "display_format" field to struct mon_evt which can specify
> that the value for the event be displayed as an integer for legacy
> events, or as a floating point value with six decimal places converted
> from the fixed point format received from architecture code.
There was no discussion on this during the previous version.
While this version addresses the issue of architecture changing the
format it does not address the issue of how to handle different
architecture formats. With this change any architecture that may
want to support any of these events will be required to translate
whatever format it uses into the one Intel uses to be translated
again into format for user space. Do you think this is reasonable?
Alternatively, resctrl could add additional file that contains the
format so that if an architecture in the future needs to present data
differently, an interface will exist to guide userspace how to parse it.
Creation of such user interface cannot be delayed until the time
it is needed since then these formats would be ABI.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl_types.h | 5 +++++
> fs/resctrl/internal.h | 2 ++
> fs/resctrl/ctrlmondata.c | 24 +++++++++++++++++++++++-
> fs/resctrl/monitor.c | 21 ++++++++++++---------
> 4 files changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index 5ef14a24008c..6245034f6c76 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
This needs to be internal to resctrl fs.
resctrl_types.h should only contain the types required in asm/resctrl.h
> @@ -50,4 +50,9 @@ enum resctrl_event_id {
> #define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> #define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
>
> +/* Event value display formats */
Please add details about what each format means (how it should
be interpreted).
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 14/31] fs/resctrl: Add an architectural hook called for each mount
2025-04-29 0:33 ` [PATCH v4 14/31] fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-05-08 15:50 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:50 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -450,6 +450,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
> void resctrl_online_cpu(unsigned int cpu);
> void resctrl_offline_cpu(unsigned int cpu);
>
> +/*
> + * Architecture hook called for each attempted file system mount
End sentence with period.
> + * No locks are held.
> + */
> +void resctrl_arch_pre_mount(void);
> +
> /**
> * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
> * for this resource and domain.
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 58bc218070e2..2f3efc4b1816 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -707,6 +707,14 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
> return 0;
> }
>
> +void resctrl_arch_pre_mount(void)
> +{
> + static atomic_t only_once;
> +
> + if (atomic_cmpxchg(&only_once, 0, 1))
> + return;
> +}
As I understand atomic_try_cmpxchg() is preferred on x86. See
"CMPXCHG vs TRY_CMPXCHG" in Documentation/atomic_t.txt for reference.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 15/31] x86/resctrl: Add and initialize rdt_resource for package scope core monitor
2025-04-29 0:33 ` [PATCH v4 15/31] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
@ 2025-05-08 15:50 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:50 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> Counts for each Intel telemetry event are periodically sent to one or
> more aggregators on each package where accumulated totals are made
> available in MMIO registers.
>
> Add a new resource for monitoring these events with code to build
It is unnecessary to say that code is used to add a change.
> domains at the package granularity.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 11 +++++++++++
> 2 files changed, 13 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 6f424fffa083..3ae50b947a99 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -53,6 +53,7 @@ enum resctrl_res_level {
> RDT_RESOURCE_L2,
> RDT_RESOURCE_MBA,
> RDT_RESOURCE_SMBA,
> + RDT_RESOURCE_PERF_PKG,
>
> /* Must be the last */
> RDT_NUM_RESOURCES,
> @@ -250,6 +251,7 @@ enum resctrl_scope {
> RESCTRL_L2_CACHE = 2,
> RESCTRL_L3_CACHE = 3,
> RESCTRL_L3_NODE,
> + RESCTRL_PACKAGE,
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 2f3efc4b1816..4d1556707c01 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -99,6 +99,15 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
> .schema_fmt = RESCTRL_SCHEMA_RANGE,
> },
> },
> + [RDT_RESOURCE_PERF_PKG] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_PERF_PKG,
https://lore.kernel.org/lkml/20250425173809.5529-20-james.morse@arm.com/
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 16/31] x86/resctrl: Add first part of telemetry event enumeration
2025-04-29 0:33 ` [PATCH v4 16/31] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
@ 2025-05-08 15:53 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:53 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
After working with this series for a while and needing to jump between
patches to understand how things fit together I would like to please ask
that you replace these "Add <num> part ..." and "Final steps ..." shortlogs
with something descriptive that reflects what the patch actually does.
This will make it much easier to jump between patches while trying
to understand how this all fits together.
On 4/28/25 5:33 PM, Tony Luck wrote:
> The OOBMSM VSEC discovery driver enumerates many different types
> of telemetry resources. Resctrl is only interested in the ones
> that are tied to an RMID value in the IA32_PQR_ASSOC MSR.
"RMID value in the IA32_PQR_ASSOC MSR" -> "RMID value that can be used
in the IA32_PQR_ASSOC MSR"?
>
> Make a request for each of the FEATURE_PER_RMID_ENERGY_TELEM and
"for each of the" -> "for the"
> FEATURE_PER_RMID_PERF_TELEM feature groups and scan the list
> of known event groups for matching guid values.
This is the first (apart from fake driver) mention of guid and it
is mentioned in a way that assumes reader knows what it means and what
the significance is. Please add more context about what guid means/represents.
>
> Configuration to follow in subsequent patches.
Please avoid "subsequent patches" in changelog.
>
> Hold onto references to any pmt_feature_groups that resctrl
How is reader expected to know what a "pmt_feature_group" is?
This work relies on a new separate feature that introduces new
data structures and itself introduces several data structures.
Please help reader to understand how this all fits together.
> uses until resctrl exit.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 8 ++
> arch/x86/kernel/cpu/resctrl/core.c | 5 ++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 113 ++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/Makefile | 1 +
> 4 files changed, 127 insertions(+)
> create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 83b20e6b25d7..571db665eca6 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -167,4 +167,12 @@ void __init intel_rdt_mbm_apply_quirk(void);
>
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>
> +#ifdef CONFIG_INTEL_AET_RESCTRL
> +bool intel_aet_get_events(void);
> +void __exit intel_aet_exit(void);
> +#else
> +static inline bool intel_aet_get_events(void) { return false; }
> +static inline void intel_aet_exit(void) { };
Extra semicolon?
> +#endif
> +
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 4d1556707c01..0103f577e4ca 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -724,6 +724,9 @@ void resctrl_arch_pre_mount(void)
>
> if (atomic_cmpxchg(&only_once, 0, 1))
> return;
> +
> + if (!intel_aet_get_events())
> + return;
> }
>
> enum {
> @@ -1076,6 +1079,8 @@ late_initcall(resctrl_arch_late_init);
>
> static void __exit resctrl_arch_exit(void)
> {
> + intel_aet_exit();
> +
> cpuhp_remove_state(rdt_online);
>
> resctrl_exit();
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> new file mode 100644
> index 000000000000..dda44baf75ae
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -0,0 +1,113 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Resource Director Technology(RDT)
> + * - Intel Application Energy Telemetry
> + *
> + * Copyright (C) 2025 Intel Corporation
> + *
> + * Author:
> + * Tony Luck <tony.luck@intel.com>
> + */
> +
> +#define pr_fmt(fmt) "resctrl: " fmt
> +
> +#include <linux/cleanup.h>
> +#include <linux/cpu.h>
> +#include <linux/resctrl.h>
> +
> +/* Temporary - delete from final version */
> +#include "fake_intel_aet_features.h"
> +
> +#include "internal.h"
> +
> +/**
> + * struct event_group - All information about a group of telemetry events.
> + * Some fields initialized with MMIO layout information
> + * gleaned from the XML files. Others are set from data
> + * retrieved from intel_pmt_get_regions_by_feature().
Please see "Structure, union, and enumeration documentation" in
Documentation/doc-guide/kernel-doc.rst on how a "brief description" is
separate from the full description.
The "some" and "others" is quite vague. The members themselves
can have snippet to indicate where it is initialized from.
> + * @pfg: The pmt_feature_group for this event group
This comment does not add any information.
"Points to the aggregated telemetry space information within the OOBMSM driver
that contains data for all telemetry regions ..."
> + * @guid: Unique number per XML description file
How can it be sure that it is clear to reader what "XML description file"
means here? Which XML file? The structure description can expand what is meant
by "the XML files" and then each member initialized from it can have a
"(initialized from XML file)" or "(from XML file)" snippet.
> + */
> +struct event_group {
> + struct pmt_feature_group *pfg;
> + int guid;
Should this be u32?
> +};
> +
> +/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
404
> +static struct event_group energy_0x26696143 = {
> + .guid = 0x26696143,
> +};
> +
> +/* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
404
> +static struct event_group perf_0x26557651 = {
> + .guid = 0x26557651,
> +};
> +
> +static struct event_group *known_event_groups[] = {
> + &energy_0x26696143,
> + &perf_0x26557651,
> +};
One has to study this series further to understand where this is going. At this point
the data structure seems to be unnecessarily complex requiring (what appears to be)
a lot of unnecessary pointer wrangling. Reader can ask here ... why pointers, why
not just an array of structs?. Adding information to explain
this choice will help to understand this work and make this easier to review.
> +
> +#define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
> +
> +static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
configure_events() is an unexpected name for a function that returns true/false.
There is no explanation for this and reader needs to read following patches to
understand what its purpose is.
> +{
> + return false;
> +}
> +
> +DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
> + if (!IS_ERR_OR_NULL(_T)) \
> + intel_pmt_put_feature_group(_T))
> +
Please document get_pmt_feature().
> +static bool get_pmt_feature(enum pmt_feature_id feature)
> +{
> + struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
> + struct event_group **peg;
> + bool ret;
> +
> + p = intel_pmt_get_regions_by_feature(feature);
> +
> + if (IS_ERR_OR_NULL(p))
> + return false;
> +
> + for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
> + for (int i = 0; i < p->count; i++) {
> + if ((*peg)->guid == p->regions[i].guid) {
> + ret = configure_events((*peg), p);
Unnecessary parenthesis?
> + if (ret) {
This introduces too much confusion. It is custom for "0" to indicate success but this uses
boolean that is an unclear choice for a function that turns out (looking a few patches ahead)
to be complex. Please change "configure_events()" to have proper return codes. For example, it
allocates memory and when that fails it should return "-ENOMEM", not "false".
> + (*peg)->pfg = no_free_ptr(p);
> + return true;
> + }
> + break;
> + }
> + }
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Ask OOBMSM discovery driver for all the RMID based telemetry groups
> + * that it supports.
This comment implies that get_pmt_feature() is an OOBMSM call?
> + */
> +bool intel_aet_get_events(void)
> +{
> + bool ret1, ret2;
> +
> + ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM);
> + ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM);
> +
> + return ret1 || ret2;
> +}
> +
> +void __exit intel_aet_exit(void)
> +{
> + struct event_group **peg;
> +
> + for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
> + if ((*peg)->pfg) {
> + intel_pmt_put_feature_group((*peg)->pfg);
> + (*peg)->pfg = NULL;
> + }
> + }
> +}
> diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
> index 28ae1c88b2ac..8b4603cad783 100644
> --- a/arch/x86/kernel/cpu/resctrl/Makefile
> +++ b/arch/x86/kernel/cpu/resctrl/Makefile
> @@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
> obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
> +obj-$(CONFIG_INTEL_AET_RESCTRL) += intel_aet.o
> obj-$(CONFIG_INTEL_AET_RESCTRL) += fake_intel_aet_features.o
> obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
>
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 17/31] x86/resctrl: Add second part of telemetry event enumeration
2025-04-29 0:33 ` [PATCH v4 17/31] x86/resctrl: Add second " Tony Luck
@ 2025-05-08 15:54 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:54 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> There may be multiple telemetry aggregators per package, each enumerated
> by a telemetry region structure in the feature group.
>
> Scan the array of telemetry region structures and count how many are
> in each package in preparation to allocate structures to save the MMIO
> addresses for each in a convenient format for use when reading event
> counters.
Note that reader does not know at this point that the subsequent processing
will be done by further expanding configure_events() or via a new function
called by get_pmt_feature() after configure_events() completes. Without knowing this
this patch looks buggy since it seems to forget to save a pointer to
initialized data.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index dda44baf75ae..a0365c3ce982 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -52,6 +52,27 @@ static struct event_group *known_event_groups[] = {
>
> static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
> {
> + int *pkgcounts __free(kfree) = NULL;
> + struct telemetry_region *tr;
> + int num_pkgs;
> +
> + num_pkgs = topology_max_packages();
> + pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
> + if (!pkgcounts)
> + return false;
-ENOMEM
> +
> + /* Get per-package counts of telemetry_region for this guid */
How should "telemetry_region" be interpreted? If this is intended to refer
to the individual structs then it should be "counts of telemetry_region structs",
if it is intended to refer to what the structs represent, it can be "telemetry
regions"
Also, it is not obvious what is mean with "this guid" ... there are two guids in
snippet below.
> + for (int i = 0; i < p->count; i++) {
> + tr = &p->regions[i];
> + if (tr->guid != e->guid)
> + continue;
> + if (tr->plat_info.package_id >= num_pkgs) {
> + pr_warn_once("Bad package %d\n", tr->plat_info.package_id);
> + continue;
> + }
> + pkgcounts[tr->plat_info.package_id]++;
> + }
> +
> return false;
configure_events() returns false on success and failure? Perhaps this is temporary
until all parsing has been implemented but that is another thing that reader needs
to guess now or look at later patches to understand.
> }
>
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 19/31] x86,fs/resctrl: Fill in details of Clearwater Forest events
2025-04-29 0:33 ` [PATCH v4 19/31] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
@ 2025-05-08 15:54 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:54 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> Clearwater Forest supports two energy related telemetry events
> and seven perf style events.
>
> Define these events in the file system code and add the events
> to the event_group structures.
>
> PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point
> format. File system code must output as floating point values.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl_types.h | 11 +++++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 31 ++++++++++++++
> fs/resctrl/monitor.c | 54 +++++++++++++++++++++++++
> 3 files changed, 96 insertions(+)
>
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index 6245034f6c76..39de5451cff8 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -43,6 +43,17 @@ enum resctrl_event_id {
> QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
> QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
>
> + /* Intel Telemetry Events */
> + PMT_EVENT_ENERGY,
> + PMT_EVENT_ACTIVITY,
> + PMT_EVENT_STALLS_LLC_HIT,
> + PMT_EVENT_C1_RES,
> + PMT_EVENT_UNHALTED_CORE_CYCLES,
> + PMT_EVENT_STALLS_LLC_MISS,
> + PMT_EVENT_AUTO_C6_RES,
> + PMT_EVENT_UNHALTED_REF_CYCLES,
> + PMT_EVENT_UOPS_RETIRED,
> +
> /* Must be the last */
> QOS_NUM_EVENTS,
> };
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 03839d5c369b..7e4f6a6672d4 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -30,6 +30,18 @@ struct mmio_info {
> void __iomem *addrs[] __counted_by(count);
> };
>
> +/**
> + * struct pmt_event - Telemetry event.
> + * @evtid: Resctrl event id
> + * @evt_idx: Counter index within each per-RMID block of counters
Where can reader find details of how this "per-RMID block of counters"
looks like?
> + */
> +struct pmt_event {
> + enum resctrl_event_id evtid;
> + int evt_idx;
> +};
> +
> +#define EVT(id, idx) { .evtid = id, .evt_idx = idx }
> +
> /**
> * struct event_group - All information about a group of telemetry events.
> * Some fields initialized with MMIO layout information
> @@ -38,21 +50,40 @@ struct mmio_info {
> * @pfg: The pmt_feature_group for this event group
> * @guid: Unique number per XML description file
> * @pkginfo: Per-package MMIO addresses
> + * @num_events: Number of events in this group
Can append (initialized from XML file) or just (from XML file).
> + * @evts: Array of event descriptors
Can append (initialized from XML file) or just (from XML file).
> */
> struct event_group {
> struct pmt_feature_group *pfg;
> int guid;
> struct mmio_info **pkginfo;
> + int num_events;
> + struct pmt_event evts[] __counted_by(num_events);
> };
>
> /* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
> static struct event_group energy_0x26696143 = {
> .guid = 0x26696143,
> + .num_events = 2,
> + .evts = {
> + EVT(PMT_EVENT_ENERGY, 0),
> + EVT(PMT_EVENT_ACTIVITY, 1),
> + }
> };
>
> /* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
> static struct event_group perf_0x26557651 = {
> .guid = 0x26557651,
> + .num_events = 7,
> + .evts = {
> + EVT(PMT_EVENT_STALLS_LLC_HIT, 0),
> + EVT(PMT_EVENT_C1_RES, 1),
> + EVT(PMT_EVENT_UNHALTED_CORE_CYCLES, 2),
> + EVT(PMT_EVENT_STALLS_LLC_MISS, 3),
> + EVT(PMT_EVENT_AUTO_C6_RES, 4),
> + EVT(PMT_EVENT_UNHALTED_REF_CYCLES, 5),
> + EVT(PMT_EVENT_UOPS_RETIRED, 6),
> + }
> };
>
> static struct event_group *known_event_groups[] = {
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index be78488a15e5..f848325591b4 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -861,6 +861,60 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
> .rid = RDT_RESOURCE_L3,
> .display_format = EVT_FORMAT_U64,
> },
> + [PMT_EVENT_ENERGY] = {
> + .name = "core_energy",
> + .evtid = PMT_EVENT_ENERGY,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U46_18,
> + },
> + [PMT_EVENT_ACTIVITY] = {
> + .name = "activity",
> + .evtid = PMT_EVENT_ACTIVITY,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U46_18,
> + },
> + [PMT_EVENT_STALLS_LLC_HIT] = {
> + .name = "stalls_llc_hit",
> + .evtid = PMT_EVENT_STALLS_LLC_HIT,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U64,
> + },
> + [PMT_EVENT_C1_RES] = {
> + .name = "c1_res",
> + .evtid = PMT_EVENT_C1_RES,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U64,
> + },
> + [PMT_EVENT_UNHALTED_CORE_CYCLES] = {
> + .name = "unhalted_core_cycles",
> + .evtid = PMT_EVENT_UNHALTED_CORE_CYCLES,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U64,
> + },
> + [PMT_EVENT_STALLS_LLC_MISS] = {
> + .name = "stalls_llc_miss",
> + .evtid = PMT_EVENT_STALLS_LLC_MISS,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U64,
> + },
> + [PMT_EVENT_AUTO_C6_RES] = {
> + .name = "c6_res",
> + .evtid = PMT_EVENT_AUTO_C6_RES,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U64,
> + },
> + [PMT_EVENT_UNHALTED_REF_CYCLES] = {
> + .name = "unhalted_ref_cycles",
> + .evtid = PMT_EVENT_UNHALTED_REF_CYCLES,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U64,
> + },
> + [PMT_EVENT_UOPS_RETIRED] = {
> + .name = "uops_retired",
> + .evtid = PMT_EVENT_UOPS_RETIRED,
> + .rid = RDT_RESOURCE_PERF_PKG,
> + .display_format = EVT_FORMAT_U64,
> + },
> };
>
> void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 18/31] x86/resctrl: Add third part of telemetry event enumeration
2025-04-29 0:33 ` [PATCH v4 18/31] x86/resctrl: Add third " Tony Luck
@ 2025-05-08 15:56 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:56 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> Counters for telemetry events are in MMIO space. Each telemetry_region
> structure returned in the pmt_feature_group returned from OOBMSM
> contains the base MMIO address for the counters.
>
> Scan all the telemetry_region structures again and gather these
> addresses into a more convenient structure with addresses for
> each aggregator indexed by package id. Note that there may be
> multiple aggregators per package.
Could this series please provide a clear definition for "telemetry
region" and "aggregator" and then use the terms consistently?
I find the comments to switch between the two causing confusion.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 55 +++++++++++++++++++++++++
> 1 file changed, 55 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index a0365c3ce982..03839d5c369b 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -20,6 +20,16 @@
>
> #include "internal.h"
>
> +/**
> + * struct mmio_info - Array of MMIO addresses for a package
Please add description on how this structure is used. Please use
these docs to help readers create a mental model of how these
data structures fit together.
I am making an attempt at an example below but I am still trying to
understand how things fit together so would appreciate if you
write this instead. (Please consider this when viewing any of the
samples I provide.)
Example,
Array of MMIO addresses of one event group for a package.
Provides convenient access to all MMIO addresses of
one event group for one package. Used when reading
event data on a package. (needs improvement)
> + * @count: Number of addresses on this package
Any information on what this number means? For example,
"Number of telemetry regions of a specific event group."
> + * @addrs: The MMIO addresses
Can the layout of MMIO space be added to the comments?
> + */
> +struct mmio_info {
> + int count;
> + void __iomem *addrs[] __counted_by(count);
> +};
> +
> /**
> * struct event_group - All information about a group of telemetry events.
> * Some fields initialized with MMIO layout information
> @@ -27,10 +37,12 @@
> * retrieved from intel_pmt_get_regions_by_feature().
> * @pfg: The pmt_feature_group for this event group
> * @guid: Unique number per XML description file
> + * @pkginfo: Per-package MMIO addresses
"Per-package MMIO addresses of telemetry regions belonging to this group."?
> */
> struct event_group {
> struct pmt_feature_group *pfg;
> int guid;
> + struct mmio_info **pkginfo;
> };
>
> /* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
> @@ -50,12 +62,33 @@ static struct event_group *known_event_groups[] = {
>
> #define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
>
> +static void free_mmio_info(struct mmio_info **mmi)
> +{
> + int num_pkgs = topology_max_packages();
> +
> + if (!mmi)
> + return;
> +
> + for (int i = 0; i < num_pkgs; i++)
> + kfree(mmi[i]);
> + kfree(mmi);
> +}
> +
> +DEFINE_FREE(mmio_info, struct mmio_info **, free_mmio_info(_T))
> +
> static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
> {
> + struct mmio_info __free(mmio_info) **pkginfo = NULL;
> int *pkgcounts __free(kfree) = NULL;
> struct telemetry_region *tr;
> + struct mmio_info *mmi;
> int num_pkgs;
>
> + if (e->pkginfo) {
> + pr_warn("Duplicate telemetry information for guid 0x%x\n", e->guid);
> + return false;
> + }
> +
> num_pkgs = topology_max_packages();
> pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
> if (!pkgcounts)
> @@ -73,6 +106,27 @@ static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
> pkgcounts[tr->plat_info.package_id]++;
> }
>
> + /* Allocate per-package arrays and save MMIO addresses */
per-package arrays of what?
> + pkginfo = kcalloc(num_pkgs, sizeof(*pkginfo), GFP_KERNEL);
> + if (!pkginfo)
> + return false;
-ENOMEM
> + for (int i = 0; i < num_pkgs; i++) {
> + pkginfo[i] = kmalloc(struct_size(pkginfo[i], addrs, pkgcounts[i]), GFP_KERNEL);
kzalloc()
> + if (!pkginfo[i])
> + return false;
-ENOMEM
> + pkginfo[i]->count = pkgcounts[i];
> + }
> +
> + /* Save MMIO address(es) for each aggregator in per-package structures */
Should "aggregator" be "telemetry region"? It is becoming confusing what "aggregator"
vs "telemetry region" represents here.
> + for (int i = 0; i < p->count; i++) {
> + tr = &p->regions[i];
> + if (tr->guid != e->guid || tr->plat_info.package_id >= num_pkgs)
> + continue;
> + mmi = pkginfo[tr->plat_info.package_id];
> + mmi->addrs[--pkgcounts[tr->plat_info.package_id]] = tr->addr;
For this code to be safe the "if()" checks that precede it must match *exactly*
with the checks used to initialize the pkgcounts array. To ensure this remains the
case I think those checks need to be placed in a function to be called in both
places.
> + }
> + e->pkginfo = no_free_ptr(pkginfo);
> +
> return false;
> }
>
> @@ -130,5 +184,6 @@ void __exit intel_aet_exit(void)
> intel_pmt_put_feature_group((*peg)->pfg);
> (*peg)->pfg = NULL;
> }
> + free_mmio_info((*peg)->pkginfo);
> }
> }
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 20/31] x86/resctrl: Check for adequate MMIO space
2025-04-29 0:33 ` [PATCH v4 20/31] x86/resctrl: Check for adequate MMIO space Tony Luck
@ 2025-05-08 15:56 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:56 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> The MMIO space for each telemetry aggregator is arranged as a array of
"a array" -> "an array"
> count registers for each event for RMID 0, followed by RMID 1, and so on.
"count registers" -> "counter registers"?
"followed by RMID 1" -> "followed by each event for RMID 1"?
> After all event counters there are three status registers. All registers
> are 8 bytes each.
>
> The total size of MMIO space as described by the XML files is thus:
>
> (NUM_RMIDS * NUM_COUNTERS + 3) * 8
>
> Add an "mmio_size" field to the event_group structure and a sanity
> check that the size reported in the telemetry_region structure obtained
> from intel_pmt_get_regions_by_feature() is as large as expected.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 7e4f6a6672d4..37dd493df250 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -49,6 +49,7 @@ struct pmt_event {
> * retrieved from intel_pmt_get_regions_by_feature().
> * @pfg: The pmt_feature_group for this event group
> * @guid: Unique number per XML description file
> + * @mmio_size: Number of bytes of mmio registers for this group
mmio -> MMIO
Can append "(from XML file)".
> * @pkginfo: Per-package MMIO addresses
> * @num_events: Number of events in this group
> * @evts: Array of event descriptors
> @@ -56,6 +57,7 @@ struct pmt_event {
> struct event_group {
> struct pmt_feature_group *pfg;
> int guid;
> + int mmio_size;
Should this be size_t?
> struct mmio_info **pkginfo;
> int num_events;
> struct pmt_event evts[] __counted_by(num_events);
> @@ -64,6 +66,7 @@ struct event_group {
> /* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
> static struct event_group energy_0x26696143 = {
> .guid = 0x26696143,
> + .mmio_size = (576 * 2 + 3) * 8,
> .num_events = 2,
> .evts = {
> EVT(PMT_EVENT_ENERGY, 0),
> @@ -74,6 +77,7 @@ static struct event_group energy_0x26696143 = {
> /* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
> static struct event_group perf_0x26557651 = {
> .guid = 0x26557651,
> + .mmio_size = (576 * 7 + 3) * 8,
> .num_events = 7,
> .evts = {
> EVT(PMT_EVENT_STALLS_LLC_HIT, 0),
> @@ -134,6 +138,10 @@ static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
> pr_warn_once("Bad package %d\n", tr->plat_info.package_id);
> continue;
> }
> + if (tr->size < e->mmio_size) {
> + pr_warn_once("MMIO space too small for guid 0x%x\n", e->guid);
> + continue;
> + }
> pkgcounts[tr->plat_info.package_id]++;
> }
>
> @@ -151,7 +159,8 @@ static bool configure_events(struct event_group *e, struct pmt_feature_group *p)
> /* Save MMIO address(es) for each aggregator in per-package structures */
> for (int i = 0; i < p->count; i++) {
> tr = &p->regions[i];
> - if (tr->guid != e->guid || tr->plat_info.package_id >= num_pkgs)
> + if (tr->guid != e->guid || tr->plat_info.package_id >= num_pkgs ||
> + tr->size < e->mmio_size)
> continue;
> mmi = pkginfo[tr->plat_info.package_id];
> mmi->addrs[--pkgcounts[tr->plat_info.package_id]] = tr->addr;
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 21/31] x86/resctrl: Add fourth part of telemetry event enumeration
2025-04-29 0:33 ` [PATCH v4 21/31] x86/resctrl: Add fourth part of telemetry event enumeration Tony Luck
@ 2025-05-08 15:56 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:56 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> At run time when a user reads an event file the file system code
> provides the enum resctrl_event_id for the event.
>
> Create a lookup table indexed by event id to provide the telem_entry
> structure and the event index into MMIO space.
https://lore.kernel.org/lkml/7bb97892-16fd-49c5-90f0-223526ebdf4c@intel.com/
>
> Enable the events marked as readable from any CPU.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 21 ++++++++++++++++++++-
> 1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 37dd493df250..e1cb6bd4788d 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -97,6 +97,16 @@ static struct event_group *known_event_groups[] = {
>
> #define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
>
> +/**
> + * struct evtinfo - lookup table from resctrl_event_id to useful information
> + * @event_group: Pointer to the telem_entry structure for this event
What is telem_entry structure?
> + * @idx: Counter index within each per-RMID block of counters
> + */
> +static struct evtinfo {
> + struct event_group *event_group;
> + int idx;
> +} evtinfo[QOS_NUM_EVENTS];
> +
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 22/31] x86/resctrl: Read core telemetry events
2025-04-29 0:33 ` [PATCH v4 22/31] x86/resctrl: Read core telemetry events Tony Luck
@ 2025-05-08 15:57 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:57 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> The resctrl file system passes requests to read event monitor files to
> the architecture resctrl_arch_rmid_read() function to collect values
nit: no need to say "function" when using ().
> from hardware counters.
>
> Use the resctrl resource to differentiate between calls to read legacy
> L3 events from the new telemetry events (which are attached to
> RDT_RESOURCE_PERF_PKG).
>
> There may be multiple devices tracking each package, so scan all of them
"devices" seems to be in the mix of similar term as aggregator and
telemetry regions. Having multiple terms for same/similar thing is confusing.
> and add up all counters.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 5 ++++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 34 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 3 +++
> 3 files changed, 42 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 571db665eca6..dd5fe8a98304 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -170,9 +170,14 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
> #ifdef CONFIG_INTEL_AET_RESCTRL
> bool intel_aet_get_events(void);
> void __exit intel_aet_exit(void);
> +int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val);
> #else
> static inline bool intel_aet_get_events(void) { return false; }
> static inline void intel_aet_exit(void) { };
> +static inline int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val)
> +{
> + return -EINVAL;
> +}
> #endif
>
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index e1cb6bd4788d..0bbf991da981 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -13,6 +13,7 @@
>
> #include <linux/cleanup.h>
> #include <linux/cpu.h>
> +#include <linux/io.h>
> #include <linux/resctrl.h>
>
> /* Temporary - delete from final version */
> @@ -246,3 +247,36 @@ void __exit intel_aet_exit(void)
> free_mmio_info((*peg)->pkginfo);
> }
> }
> +
> +#define VALID_BIT BIT_ULL(63)
> +#define DATA_BITS GENMASK_ULL(62, 0)
> +
> +/*
> + * Read counter for an event on a domain (summing all aggregators
> + * on the domain).
> + */
> +int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val)
> +{
> + struct evtinfo *info = &evtinfo[evtid];
> + struct mmio_info *mmi;
> + u64 evtcount;
> + int idx;
> +
> + idx = rmid * info->event_group->num_events;
> + idx += info->idx;
> + mmi = info->event_group->pkginfo[domid];
> +
> + if (idx * sizeof(u64) > info->event_group->mmio_size) {
Reading offset "idx * sizeof(u64)" when
"idx * sizeof(u64) == info->event_group->mmio_size" is overflow, no?
How about (please check):
if (idx * sizeof(u64) - sizeof(u64) >= info->event_group->mmio_size)
> + pr_warn_once("MMIO index %d out of range\n", idx);
> + return -EINVAL;
The function's return percolates up to rdtgroup_mondata_show() where
the return code is translated into text: -EINVAL becomes "Unavailable"
and -EIO becomes "Error". Seems like this should be -EIO instead?
> + }
> +
> + for (int i = 0; i < mmi->count; i++) {
> + evtcount = readq(mmi->addrs[i] + idx * sizeof(u64));
> + if (!(evtcount & VALID_BIT))
> + return -EINVAL;
What does set of "VALID_BIT" mean? That it is a valid counter or
that the data within is valid?
> + *val += evtcount & DATA_BITS;
> + }
> +
> + return 0;
> +}
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 8d8ec86929fa..04214585824b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -237,6 +237,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
>
> resctrl_arch_rmid_read_context_check();
>
> + if (r->rid == RDT_RESOURCE_PERF_PKG)
> + return intel_aet_read_event(d->hdr.id, rmid, eventid, val);
> +
Please add comment or check that code that follows is for L3 resource
> prmid = logical_rmid_to_physical_rmid(cpu, rmid);
> ret = __rmid_read_phys(prmid, eventid, &msr_val);
> if (ret)
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 23/31] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-04-29 0:33 ` [PATCH v4 23/31] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-05-08 15:58 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:58 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> The L3 resource has several requirements for domains. There are structures
> that hold the 64-bit values of counters, and elements to keep track of
> the overflow and limbo threads.
>
> None of these are needed for the PERF_PKG resource. The hardware counters
> are wide enough that they do not wrap around for decades.
>
> Define a new rdt_perf_pkg_mon_domain structure which just consists of
> the standard rdt_domain_hdr to keep track of domain id and CPU mask.
>
> Change domain_add_cpu_mon(), domain_remove_cpu_mon(),
> resctrl_offline_mon_domain(), and resctrl_online_mon_domain() to check
> resource type and perform only the operations needed for domsins in the
domsins -> domains
> PERF_PKG resource.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 41 ++++++++++++++++++++++++++++++
> fs/resctrl/rdtgroup.c | 4 +++
> 2 files changed, 45 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0103f577e4ca..97fb2001c8d8 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -543,6 +543,38 @@ static void setup_l3_mon_domain(int cpu, int id, struct rdt_resource *r, struct
> }
> }
>
> +/**
> + * struct rdt_perf_pkg_mon_domain - CPUs sharing an Intel-PMT-scoped resctrl monitor resource
This should not be architecture specific. My first reaction was that this belongs in fs code
but I remember that the advice is that things should only move there when needed to be
shared among architectures. This may thus be ok like this for now.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 26/31] x86/resctrl: Add energy/perf choices to rdt boot option
2025-04-29 0:33 ` [PATCH v4 26/31] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
@ 2025-05-08 15:58 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:58 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> Users may want to force either of the telemetry features on
> (in the case where they are disabled due to erratum) or off
> (in the case that a limited number of RMIDs for a telemetry
> feature reduces the number of monitor groups that can be
> created.)
>
> Unlike other options that are tied to X86_FEATURE_* flags,
> these must be queried by name. Add a function to do that.
>
> Add checks for users who forced either feature off.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> .../admin-guide/kernel-parameters.txt | 2 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 19 +++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 6 ++++++
> 4 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index d9fd26b95b34..4811bc812f0f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5988,7 +5988,7 @@
> rdt= [HW,X86,RDT]
> Turn on/off individual RDT features. List is:
> cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
> - mba, smba, bmec.
> + mba, smba, bmec, energy, perf.
> E.g. to turn on cmt and turn off mba use:
> rdt=cmt,!mba
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index dd5fe8a98304..92cbba9d82a8 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -167,6 +167,8 @@ void __init intel_rdt_mbm_apply_quirk(void);
>
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>
> +bool rdt_check_option(char *name, bool is_on, bool is_off);
> +
> #ifdef CONFIG_INTEL_AET_RESCTRL
> bool intel_aet_get_events(void);
> void __exit intel_aet_exit(void);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 9fa4cc66faf4..dc312e24ab87 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -795,6 +795,8 @@ enum {
> RDT_FLAG_MBA,
> RDT_FLAG_SMBA,
> RDT_FLAG_BMEC,
> + RDT_FLAG_ENERGY,
> + RDT_FLAG_PERF,
> };
>
> #define RDT_OPT(idx, n, f) \
> @@ -820,6 +822,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
> RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
> RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
> RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
> + RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
> + RDT_OPT(RDT_FLAG_PERF, "perf", 0),
> };
> #define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
>
> @@ -869,6 +873,21 @@ bool rdt_cpu_has(int flag)
> return ret;
> }
>
> +/* Check if a named option has been forced on, or forced off */
> +bool rdt_check_option(char *name, bool is_on, bool is_off)
Please make it obvious what this function does. What does "is_on/is_off"
parameters represent?
What does it mean when this function returns "true" vs "false"?
Also please reconsider the name of this function to help make it obvious
to reader how to interpret return value.
> +{
> + struct rdt_options *o;
> +
> + WARN_ON(!(is_on ^ is_off));
> +
> + for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
> + if (!strcmp(name, o->name))
> + return (is_on && o->force_on) || (is_off && o->force_off);
> + }
> +
> + return false;
> +}
> +
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt)
> {
> if (!rdt_cpu_has(X86_FEATURE_BMEC))
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 0bbf991da981..aacaedcc7b74 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -49,6 +49,7 @@ struct pmt_event {
> * gleaned from the XML files. Others are set from data
> * retrieved from intel_pmt_get_regions_by_feature().
> * @pfg: The pmt_feature_group for this event group
> + * @name: Name for this group
> * @guid: Unique number per XML description file
> * @mmio_size: Number of bytes of mmio registers for this group
> * @pkginfo: Per-package MMIO addresses
> @@ -57,6 +58,7 @@ struct pmt_event {
> */
> struct event_group {
> struct pmt_feature_group *pfg;
> + char *name;
> int guid;
> int mmio_size;
> struct mmio_info **pkginfo;
> @@ -66,6 +68,7 @@ struct event_group {
>
> /* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-ENERGY *.xml */
> static struct event_group energy_0x26696143 = {
> + .name = "energy",
> .guid = 0x26696143,
> .mmio_size = (576 * 2 + 3) * 8,
> .num_events = 2,
> @@ -77,6 +80,7 @@ static struct event_group energy_0x26696143 = {
>
> /* Link: https://github.com/intel/Intel-PMT xml/CWF/OOBMSM/RMID-PERF *.xml */
> static struct event_group perf_0x26557651 = {
> + .name = "perf",
> .guid = 0x26557651,
> .mmio_size = (576 * 7 + 3) * 8,
> .num_events = 7,
> @@ -208,6 +212,8 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
> for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
> for (int i = 0; i < p->count; i++) {
> if ((*peg)->guid == p->regions[i].guid) {
> + if (rdt_check_option((*peg)->name, false, true))
What does the "false" and "true" represent here?
> + return false;
> ret = configure_events((*peg), p);
> if (ret) {
> (*peg)->pfg = no_free_ptr(p);
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 27/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources
2025-04-29 0:33 ` [PATCH v4 27/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
@ 2025-05-08 15:59 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 15:59 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 4/28/25 5:33 PM, Tony Luck wrote:
> There are now three meanings for "number of RMIDs":
>
> 1) The number for legacy features enumerated by CPUID leaf 0xF. This
> is the maximum number of distinct values that can be loaded into the
> IA32_PQR_ASSOC MSR. Note that systems with Sub-NUMA Cluster mode enabled
> will force scaling down the CPUID enumerated value by the number of SNC
> nodes per L3-cache.
>
> 2) The number of registers in MMIO space for each event. This
> is enumerated in the XML files and is the value placed into
> event_group::num_rmids.
This is unexpected and not true at this point. Instead this is something
this patch introduces ... kindof since the value is obtained from XML file
and then "adjusted".
>
> 3) The number of "h/w counters" (this isn't a strictly accurate
> description of how things work, but serves as a useful analogy that
> does describe the limitations) feeding to those MMIO registers. This
> is enumerated in telemetry_region::num_rmids returned from the call to
> intel_pmt_get_regions_by_feature()
>
> Event groups with insufficient "h/w counter" to track all RMIDs are
> difficult for users to use, since the system may reassign "h/w counters"
> as any time. This means that users cannot reliably collect two consecutive
> event counts to compute the rate at which events are occurring.
>
> Ignore such under-resourced event groups unless the user explicitly
> requests to enable them using the "rdt=" Linux boot argument.
>
> Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG
> resource "num_rmids" value to the smallest of these values to ensure
> that all resctrl groups have equal monitor capabilities.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 25 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 2 ++
> 3 files changed, 29 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 92cbba9d82a8..31499bcd2065 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -18,6 +18,8 @@
>
> #define RMID_VAL_UNAVAIL BIT_ULL(62)
>
> +extern int rdt_num_system_rmids;
> +
> /*
> * With the above fields in use 62 bits remain in MSR_IA32_QM_CTR for
> * data to be returned. The counter width is discovered from the hardware
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index aacaedcc7b74..eec5eb625f13 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -14,6 +14,7 @@
> #include <linux/cleanup.h>
> #include <linux/cpu.h>
> #include <linux/io.h>
> +#include <linux/minmax.h>
> #include <linux/resctrl.h>
>
> /* Temporary - delete from final version */
> @@ -51,6 +52,7 @@ struct pmt_event {
> * @pfg: The pmt_feature_group for this event group
> * @name: Name for this group
> * @guid: Unique number per XML description file
> + * @num_rmids: Number of RMIDS supported by this group
Can append (from XML file, then adjusted ... ?)
> * @mmio_size: Number of bytes of mmio registers for this group
> * @pkginfo: Per-package MMIO addresses
> * @num_events: Number of events in this group
> @@ -60,6 +62,7 @@ struct event_group {
> struct pmt_feature_group *pfg;
> char *name;
> int guid;
> + int num_rmids;
> int mmio_size;
> struct mmio_info **pkginfo;
> int num_events;
> @@ -70,6 +73,7 @@ struct event_group {
> static struct event_group energy_0x26696143 = {
> .name = "energy",
> .guid = 0x26696143,
> + .num_rmids = 576,
> .mmio_size = (576 * 2 + 3) * 8,
> .num_events = 2,
> .evts = {
> @@ -82,6 +86,7 @@ static struct event_group energy_0x26696143 = {
> static struct event_group perf_0x26557651 = {
> .name = "perf",
> .guid = 0x26557651,
> + .num_rmids = 576,
> .mmio_size = (576 * 7 + 3) * 8,
> .num_events = 7,
> .evts = {
> @@ -214,6 +219,15 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
> if ((*peg)->guid == p->regions[i].guid) {
> if (rdt_check_option((*peg)->name, false, true))
> return false;
> + /*
> + * Ignore event group with insufficient RMIDs unless the
> + * user used the rdt= boot option to specifically ask
> + * for it to be enabled.
> + */
> + if (p->regions[i].num_rmids < rdt_num_system_rmids &&
> + !rdt_check_option((*peg)->name, true, false))
> + return false;
> + (*peg)->num_rmids = p->regions[i].num_rmids;
Does this need a min()? Since this cycles through multiple regions it seems possible that
if the regions support different numbers of RMIDs then the event group's adjustment to
deal with a region with few RMIDs will be undone by a following region with more RMIDs.
> ret = configure_events((*peg), p);
> if (ret) {
> (*peg)->pfg = no_free_ptr(p);
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable
2025-05-08 3:28 ` Reinette Chatre
@ 2025-05-08 18:32 ` Luck, Tony
2025-05-08 23:44 ` Reinette Chatre
0 siblings, 1 reply; 72+ messages in thread
From: Luck, Tony @ 2025-05-08 18:32 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Wed, May 07, 2025 at 08:28:56PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 4/28/25 5:33 PM, Tony Luck wrote:
> > The fs/arch boundary is a little muddy for adding new monitor features.
>
> It is not possible to accurately interpret what is meant with "little muddy".
> Please add specific information that can be verified/reasoned about.
I'll work on something more descriptive/useful.
> >
> > Clean it up by making the mon_evt structure the source of all information
> > about each event. In this case replace the bitmap of enabled monitor
> > features with an "enabled" bit in the mon_evt structure.
>
> bit -> boolean?
Will fix ("bit" was left over from earlier implementation).
> >
> > Change architecture code to inform file system code which events are
> > available on a system with resctrl_enable_mon_event().
>
> (nit: no need to mention that a patch changes code, it should be implied.)
>
> This could be, "An architecture uses resctrl_enable_mon_event() to inform
> resctrl fs which events are enabled on the system."
Will update with this.
> (I think we need to be cautious about the "available" vs "enabled"
> distinction.)
Maybe a comment above mon_event_all[]?
/*
* All available events. Architecture code marks the ones that
* are supported by a system using resctrl_enable_mon_event()
* to set .enabled.
*/
struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
> >
> > Replace the event and architecture specific:
> > resctrl_arch_is_llc_occupancy_enabled()
> > resctrl_arch_is_mbm_total_enabled()
> > resctrl_arch_is_mbm_local_enabled()
> > functions with calls to resctrl_is_mon_event_enabled() with the
> > appropriate QOS_L3_* enum resctrl_event_id.
>
> No mention or motivation for the new array. I think the new array is an
> improvement and now it begs the question whether rdt_resource::evt_list is
> still needed? It seems to me that any usage of rdt_resource::evt_list can
> use the new mon_event_all[] instead?
Good suggestion. rdt_resource::evt_list can indeed be dropped. A
standalone patch to do so reduces lines of code:
include/linux/resctrl.h | 2 --
fs/resctrl/internal.h | 2 --
fs/resctrl/monitor.c | 18 +-----------------
fs/resctrl/rdtgroup.c | 11 ++++++-----
4 files changed, 7 insertions(+), 26 deletions(-)
But I'll merge into one of the early patches to avoid adding new code to create
the evt_list and then delete it again.
> With struct mon_evt being independent like before this
> patch it almost seems as though it prepared for multiple resources to
> support the same event (do you know history here?). This appears to already
> be thwarted by rdt_mon_features though ... although theoretically it could
> have been "rdt_l3_mon_features".
> Even so, with patch #4 adding the resource ID all event information is
> centralized. Only potential issue may be if multiple resources use the
> same event ... but since the existing event IDs already have resource
> name embedded this does not seem to be of concern?
The existing evt_list approach would corrupt the lists if the same event
were added to multiple resources. Without the list this becomes
possible, but seems neither desirable, nor useful.
I will add a warning to resctrl_enable_mon_event() if architecture
code tries to enable an already enabled event.
>
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
>
> ...
>
> > @@ -866,14 +879,13 @@ static struct mon_evt mbm_local_event = {
> > */
> > static void l3_mon_evt_init(struct rdt_resource *r)
> > {
> > + enum resctrl_event_id evt;
> > +
> > INIT_LIST_HEAD(&r->evt_list);
> >
> > - if (resctrl_arch_is_llc_occupancy_enabled())
> > - list_add_tail(&llc_occupancy_event.list, &r->evt_list);
> > - if (resctrl_arch_is_mbm_total_enabled())
> > - list_add_tail(&mbm_total_event.list, &r->evt_list);
> > - if (resctrl_arch_is_mbm_local_enabled())
> > - list_add_tail(&mbm_local_event.list, &r->evt_list);
> > + for (evt = 0; evt < QOS_NUM_EVENTS; evt++)
> > + if (mon_event_all[evt].enabled)
> > + list_add_tail(&mon_event_all[evt].list, &r->evt_list);
> > }
>
> This hunk can create confusion with it adding "all enabled events" to
> a single resource. I understand that at this point only L3 supports monitoring
> and this works ok, but in the context of this work it creates a caveat early
> in series that needs to be fixed later (patch #4). This wrangling becomes
> unnecessary if removing rdt_resource::evt_list.
I'll see if I can get a clean sequence between these patches to avoid
this confusion. Maybe evt_list removal needs to happen here.
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-05-08 15:49 ` Reinette Chatre
@ 2025-05-08 20:28 ` Luck, Tony
2025-05-08 23:45 ` Reinette Chatre
0 siblings, 1 reply; 72+ messages in thread
From: Luck, Tony @ 2025-05-08 20:28 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> shortlog nit: "fs/resctrl: Support additional monitor event display formats"
>
> On 4/28/25 5:33 PM, Tony Luck wrote:
> > Resctrl was written with the assumption that all monitor events
> > can be displayed as unsigned decimal integers.
> >
> > Some telemetry events provide greater precision where architecture code
> > uses a fixed point format with 18 binary places.
> >
> > Add a "display_format" field to struct mon_evt which can specify
> > that the value for the event be displayed as an integer for legacy
> > events, or as a floating point value with six decimal places converted
> > from the fixed point format received from architecture code.
>
> There was no discussion on this during the previous version.
> While this version addresses the issue of architecture changing the
> format it does not address the issue of how to handle different
> architecture formats. With this change any architecture that may
> want to support any of these events will be required to translate
> whatever format it uses into the one Intel uses to be translated
> again into format for user space. Do you think this is reasonable?
>
> Alternatively, resctrl could add additional file that contains the
> format so that if an architecture in the future needs to present data
> differently, an interface will exist to guide userspace how to parse it.
> Creation of such user interface cannot be delayed until the time
> it is needed since then these formats would be ABI.
What if resctrl filesystem allows architecture to supply the number
of binary places for fixed point values when enabling an event?
That would allow h/w implementations to pick an appropriate precision
for each new event. Different implementations of the same event
(e.g. "core_energy") may pick different precision across architectures
or between generations of the same architecture.
File system code can then do:
if (binary_places == 0)
display as integer
else
convert to floating point (with one decimal place per
three binary places)
Existing events are all integers and won't change (it would be weird
for an architecture to report "mbm_local_bytes" with a fixed point
rather than integer value).
New events may report in either integer or floating point format
with varying amounts of precision. But I'm not sure that would be
a burden for writing tools that can run on different architectures.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > include/linux/resctrl_types.h | 5 +++++
> > fs/resctrl/internal.h | 2 ++
> > fs/resctrl/ctrlmondata.c | 24 +++++++++++++++++++++++-
> > fs/resctrl/monitor.c | 21 ++++++++++++---------
> > 4 files changed, 42 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> > index 5ef14a24008c..6245034f6c76 100644
> > --- a/include/linux/resctrl_types.h
> > +++ b/include/linux/resctrl_types.h
>
> This needs to be internal to resctrl fs.
> resctrl_types.h should only contain the types required in asm/resctrl.h
>
> > @@ -50,4 +50,9 @@ enum resctrl_event_id {
> > #define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> > #define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
> >
> > +/* Event value display formats */
>
> Please add details about what each format means (how it should
> be interpreted).
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable
2025-05-08 18:32 ` Luck, Tony
@ 2025-05-08 23:44 ` Reinette Chatre
0 siblings, 0 replies; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 23:44 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 5/8/25 11:32 AM, Luck, Tony wrote:
> On Wed, May 07, 2025 at 08:28:56PM -0700, Reinette Chatre wrote:
>> On 4/28/25 5:33 PM, Tony Luck wrote:
...
>>> Change architecture code to inform file system code which events are
>>> available on a system with resctrl_enable_mon_event().
>>
>> (nit: no need to mention that a patch changes code, it should be implied.)
>>
>> This could be, "An architecture uses resctrl_enable_mon_event() to inform
>> resctrl fs which events are enabled on the system."
>
> Will update with this.
>
>> (I think we need to be cautious about the "available" vs "enabled"
>> distinction.)
>
> Maybe a comment above mon_event_all[]?
Good idea.
>
> /*
> * All available events. Architecture code marks the ones that
I think "available" may be interpreted differently by people.
How about "All known events."?
> * are supported by a system using resctrl_enable_mon_event()
> * to set .enabled.
> */
> struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
>
>>>
>>> Replace the event and architecture specific:
>>> resctrl_arch_is_llc_occupancy_enabled()
>>> resctrl_arch_is_mbm_total_enabled()
>>> resctrl_arch_is_mbm_local_enabled()
>>> functions with calls to resctrl_is_mon_event_enabled() with the
>>> appropriate QOS_L3_* enum resctrl_event_id.
>>
>> No mention or motivation for the new array. I think the new array is an
>> improvement and now it begs the question whether rdt_resource::evt_list is
>> still needed? It seems to me that any usage of rdt_resource::evt_list can
>> use the new mon_event_all[] instead?
>
> Good suggestion. rdt_resource::evt_list can indeed be dropped. A
> standalone patch to do so reduces lines of code:
>
> include/linux/resctrl.h | 2 --
> fs/resctrl/internal.h | 2 --
> fs/resctrl/monitor.c | 18 +-----------------
> fs/resctrl/rdtgroup.c | 11 ++++++-----
> 4 files changed, 7 insertions(+), 26 deletions(-)
>
> But I'll merge into one of the early patches to avoid adding new code to create
> the evt_list and then delete it again.
Thanks for considering.
>
>> With struct mon_evt being independent like before this
>> patch it almost seems as though it prepared for multiple resources to
>> support the same event (do you know history here?). This appears to already
>> be thwarted by rdt_mon_features though ... although theoretically it could
>> have been "rdt_l3_mon_features".
>> Even so, with patch #4 adding the resource ID all event information is
>> centralized. Only potential issue may be if multiple resources use the
>> same event ... but since the existing event IDs already have resource
>> name embedded this does not seem to be of concern?
>
> The existing evt_list approach would corrupt the lists if the same event
> were added to multiple resources. Without the list this becomes
> possible, but seems neither desirable, nor useful.
ack. With an event array indexed by event ID it would also take some additional
changes to support.
>
> I will add a warning to resctrl_enable_mon_event() if architecture
> code tries to enable an already enabled event.
Thank you very much.
>>
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>
>> ...
>>
>>> @@ -866,14 +879,13 @@ static struct mon_evt mbm_local_event = {
>>> */
>>> static void l3_mon_evt_init(struct rdt_resource *r)
>>> {
>>> + enum resctrl_event_id evt;
>>> +
>>> INIT_LIST_HEAD(&r->evt_list);
>>>
>>> - if (resctrl_arch_is_llc_occupancy_enabled())
>>> - list_add_tail(&llc_occupancy_event.list, &r->evt_list);
>>> - if (resctrl_arch_is_mbm_total_enabled())
>>> - list_add_tail(&mbm_total_event.list, &r->evt_list);
>>> - if (resctrl_arch_is_mbm_local_enabled())
>>> - list_add_tail(&mbm_local_event.list, &r->evt_list);
>>> + for (evt = 0; evt < QOS_NUM_EVENTS; evt++)
>>> + if (mon_event_all[evt].enabled)
>>> + list_add_tail(&mon_event_all[evt].list, &r->evt_list);
>>> }
>>
>> This hunk can create confusion with it adding "all enabled events" to
>> a single resource. I understand that at this point only L3 supports monitoring
>> and this works ok, but in the context of this work it creates a caveat early
>> in series that needs to be fixed later (patch #4). This wrangling becomes
>> unnecessary if removing rdt_resource::evt_list.
>
> I'll see if I can get a clean sequence between these patches to avoid
> this confusion. Maybe evt_list removal needs to happen here.
Thank you.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-05-08 20:28 ` Luck, Tony
@ 2025-05-08 23:45 ` Reinette Chatre
2025-05-09 11:29 ` Dave Martin
0 siblings, 1 reply; 72+ messages in thread
From: Reinette Chatre @ 2025-05-08 23:45 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 5/8/25 1:28 PM, Luck, Tony wrote:
> On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
>> On 4/28/25 5:33 PM, Tony Luck wrote:
>>> Resctrl was written with the assumption that all monitor events
>>> can be displayed as unsigned decimal integers.
>>>
>>> Some telemetry events provide greater precision where architecture code
>>> uses a fixed point format with 18 binary places.
>>>
>>> Add a "display_format" field to struct mon_evt which can specify
>>> that the value for the event be displayed as an integer for legacy
>>> events, or as a floating point value with six decimal places converted
>>> from the fixed point format received from architecture code.
>>
>> There was no discussion on this during the previous version.
>> While this version addresses the issue of architecture changing the
>> format it does not address the issue of how to handle different
>> architecture formats. With this change any architecture that may
>> want to support any of these events will be required to translate
>> whatever format it uses into the one Intel uses to be translated
>> again into format for user space. Do you think this is reasonable?
>>
>> Alternatively, resctrl could add additional file that contains the
>> format so that if an architecture in the future needs to present data
>> differently, an interface will exist to guide userspace how to parse it.
>> Creation of such user interface cannot be delayed until the time
>> it is needed since then these formats would be ABI.
>
> What if resctrl filesystem allows architecture to supply the number
> of binary places for fixed point values when enabling an event?
This sounds good. I do not think we are in a position to come up with
an ideal solution. That would require assumptions of what another
architecture may or may not do and thus we do not have complete information.
>
> That would allow h/w implementations to pick an appropriate precision
> for each new event. Different implementations of the same event
> (e.g. "core_energy") may pick different precision across architectures
> or between generations of the same architecture.
>
> File system code can then do:
>
> if (binary_places == 0)
> display as integer
> else
> convert to floating point (with one decimal place per
> three binary places)
I do not think this problem needs to be solved in this work but there needs
to be a plan for how other architectures can be supported. When similar
enabling needs to be done for that hypothetical architecture then it can
be implemented ... if it is still valid based on what that architecture actually
supports.
It may be sufficient for the "plan" (as above) to be in comments.
>
> Existing events are all integers and won't change (it would be weird
> for an architecture to report "mbm_local_bytes" with a fixed point
> rather than integer value).
>
> New events may report in either integer or floating point format
> with varying amounts of precision. But I'm not sure that would be
Partly this will depend on the unit of measurement that should form part of
the definition of the event. For example, events reporting cycles or ticks
should only be integer, no?
> a burden for writing tools that can run on different architectures.
Maybe just a comment in the docs then ... and now I see that you did
so already. My apologies, I did not get to the last four patches.
Reinette
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-05-08 23:45 ` Reinette Chatre
@ 2025-05-09 11:29 ` Dave Martin
2025-05-09 14:46 ` Peter Newman
0 siblings, 1 reply; 72+ messages in thread
From: Dave Martin @ 2025-05-09 11:29 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi,
(Backtrace retained for context -- see my comment at the end.)
Cheers
---Dave
[...]
On Thu, May 08, 2025 at 04:45:21PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/8/25 1:28 PM, Luck, Tony wrote:
> > On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> >> On 4/28/25 5:33 PM, Tony Luck wrote:
> >>> Resctrl was written with the assumption that all monitor events
> >>> can be displayed as unsigned decimal integers.
> >>>
> >>> Some telemetry events provide greater precision where architecture code
> >>> uses a fixed point format with 18 binary places.
> >>>
> >>> Add a "display_format" field to struct mon_evt which can specify
> >>> that the value for the event be displayed as an integer for legacy
> >>> events, or as a floating point value with six decimal places converted
> >>> from the fixed point format received from architecture code.
> >>
> >> There was no discussion on this during the previous version.
> >> While this version addresses the issue of architecture changing the
> >> format it does not address the issue of how to handle different
> >> architecture formats. With this change any architecture that may
> >> want to support any of these events will be required to translate
> >> whatever format it uses into the one Intel uses to be translated
> >> again into format for user space. Do you think this is reasonable?
> >>
> >> Alternatively, resctrl could add additional file that contains the
> >> format so that if an architecture in the future needs to present data
> >> differently, an interface will exist to guide userspace how to parse it.
> >> Creation of such user interface cannot be delayed until the time
> >> it is needed since then these formats would be ABI.
> >
> > What if resctrl filesystem allows architecture to supply the number
> > of binary places for fixed point values when enabling an event?
>
> This sounds good. I do not think we are in a position to come up with
> an ideal solution. That would require assumptions of what another
> architecture may or may not do and thus we do not have complete information.
>
> >
> > That would allow h/w implementations to pick an appropriate precision
> > for each new event. Different implementations of the same event
> > (e.g. "core_energy") may pick different precision across architectures
> > or between generations of the same architecture.
> >
> > File system code can then do:
> >
> > if (binary_places == 0)
> > display as integer
> > else
> > convert to floating point (with one decimal place per
> > three binary places)
>
> I do not think this problem needs to be solved in this work but there needs
> to be a plan for how other architectures can be supported. When similar
> enabling needs to be done for that hypothetical architecture then it can
> be implemented ... if it is still valid based on what that architecture actually
> supports.
> It may be sufficient for the "plan" (as above) to be in comments.
>
> >
> > Existing events are all integers and won't change (it would be weird
> > for an architecture to report "mbm_local_bytes" with a fixed point
> > rather than integer value).
> >
> > New events may report in either integer or floating point format
> > with varying amounts of precision. But I'm not sure that would be
>
> Partly this will depend on the unit of measurement that should form part of
> the definition of the event. For example, events reporting cycles or ticks
> should only be integer, no?
>
> > a burden for writing tools that can run on different architectures.
>
> Maybe just a comment in the docs then ... and now I see that you did
> so already. My apologies, I did not get to the last four patches.
>
> Reinette
Just a thought, but I think that while it's not possible to be fully
generic, a parameter model along the lines of
quantity = raw_value * ((double)multiplier / divisor) * BASE_UNIT
would cover most things that we have or can reasonably foresee,
including memory bandwidth control values.
raw_value, multiplier and divisor would all be integers.
Since raw_integer can be the value used by the hardware, its precision
can probably be fixed at 1, though we could still report it explicitly.
Fundamental base units would be things like "byte", "bytes per second"
and "none" (i.e., dimensionless quantities). (Are there others?)
Since we cannot guess for certain what userspace wants to do with the
values, it feels better to let userspace do any scaling calculations
itself, rather than trying to prettify the interface.
For example: scaling memory bandwidth percentages for MPAM is a
nuisance because the hardware uses fixed-point values scaled by a power
of 2, not by 100: the two scales can never match up anywhere except at
multiples of 25%, leading to irregular increments when rounded to an
integer percentage value and uncertainty about what the bandwidth_gran
parameter means. Round-trip conversions between the two
representations become error-prone due to repeated rounding -- this
proved quite fiddly to get right. Precision beyond 1% increments may
also be available in the hardware, but is not accessible through the
resctrl interface.
For backwards compatibility we probably shouldn't change that
particular interface, but if we can avoid new instances of the same
kind of problem then that would be a benefit: i.e., explicitly tell
userspace how to scale a given parameter.
Cheers
---Dave
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-05-09 11:29 ` Dave Martin
@ 2025-05-09 14:46 ` Peter Newman
2025-05-09 16:38 ` Luck, Tony
2025-05-09 16:43 ` Dave Martin
0 siblings, 2 replies; 72+ messages in thread
From: Peter Newman @ 2025-05-09 14:46 UTC (permalink / raw)
To: Dave Martin
Cc: Reinette Chatre, Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman,
James Morse, Babu Moger, Drew Fustini, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Dave,
On Fri, May 9, 2025 at 1:29 PM Dave Martin <Dave.Martin@arm.com> wrote:
>
> Hi,
>
> (Backtrace retained for context -- see my comment at the end.)
>
> Cheers
> ---Dave
>
> [...]
>
> On Thu, May 08, 2025 at 04:45:21PM -0700, Reinette Chatre wrote:
> > Hi Tony,
> >
> > On 5/8/25 1:28 PM, Luck, Tony wrote:
> > > On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> > >> On 4/28/25 5:33 PM, Tony Luck wrote:
> > >>> Resctrl was written with the assumption that all monitor events
> > >>> can be displayed as unsigned decimal integers.
> > >>>
> > >>> Some telemetry events provide greater precision where architecture code
> > >>> uses a fixed point format with 18 binary places.
> > >>>
> > >>> Add a "display_format" field to struct mon_evt which can specify
> > >>> that the value for the event be displayed as an integer for legacy
> > >>> events, or as a floating point value with six decimal places converted
> > >>> from the fixed point format received from architecture code.
> > >>
> > >> There was no discussion on this during the previous version.
> > >> While this version addresses the issue of architecture changing the
> > >> format it does not address the issue of how to handle different
> > >> architecture formats. With this change any architecture that may
> > >> want to support any of these events will be required to translate
> > >> whatever format it uses into the one Intel uses to be translated
> > >> again into format for user space. Do you think this is reasonable?
> > >>
> > >> Alternatively, resctrl could add additional file that contains the
> > >> format so that if an architecture in the future needs to present data
> > >> differently, an interface will exist to guide userspace how to parse it.
> > >> Creation of such user interface cannot be delayed until the time
> > >> it is needed since then these formats would be ABI.
> > >
> > > What if resctrl filesystem allows architecture to supply the number
> > > of binary places for fixed point values when enabling an event?
> >
> > This sounds good. I do not think we are in a position to come up with
> > an ideal solution. That would require assumptions of what another
> > architecture may or may not do and thus we do not have complete information.
> >
> > >
> > > That would allow h/w implementations to pick an appropriate precision
> > > for each new event. Different implementations of the same event
> > > (e.g. "core_energy") may pick different precision across architectures
> > > or between generations of the same architecture.
> > >
> > > File system code can then do:
> > >
> > > if (binary_places == 0)
> > > display as integer
> > > else
> > > convert to floating point (with one decimal place per
> > > three binary places)
> >
> > I do not think this problem needs to be solved in this work but there needs
> > to be a plan for how other architectures can be supported. When similar
> > enabling needs to be done for that hypothetical architecture then it can
> > be implemented ... if it is still valid based on what that architecture actually
> > supports.
> > It may be sufficient for the "plan" (as above) to be in comments.
> >
> > >
> > > Existing events are all integers and won't change (it would be weird
> > > for an architecture to report "mbm_local_bytes" with a fixed point
> > > rather than integer value).
> > >
> > > New events may report in either integer or floating point format
> > > with varying amounts of precision. But I'm not sure that would be
> >
> > Partly this will depend on the unit of measurement that should form part of
> > the definition of the event. For example, events reporting cycles or ticks
> > should only be integer, no?
> >
> > > a burden for writing tools that can run on different architectures.
> >
> > Maybe just a comment in the docs then ... and now I see that you did
> > so already. My apologies, I did not get to the last four patches.
> >
> > Reinette
>
> Just a thought, but I think that while it's not possible to be fully
> generic, a parameter model along the lines of
>
> quantity = raw_value * ((double)multiplier / divisor) * BASE_UNIT
>
> would cover most things that we have or can reasonably foresee,
> including memory bandwidth control values.
> raw_value, multiplier and divisor would all be integers.
>
> Since raw_integer can be the value used by the hardware, its precision
> can probably be fixed at 1, though we could still report it explicitly.
>
> Fundamental base units would be things like "byte", "bytes per second"
> and "none" (i.e., dimensionless quantities). (Are there others?)
>
>
> Since we cannot guess for certain what userspace wants to do with the
> values, it feels better to let userspace do any scaling calculations
> itself, rather than trying to prettify the interface.
>
> For example: scaling memory bandwidth percentages for MPAM is a
> nuisance because the hardware uses fixed-point values scaled by a power
> of 2, not by 100: the two scales can never match up anywhere except at
> multiples of 25%, leading to irregular increments when rounded to an
> integer percentage value and uncertainty about what the bandwidth_gran
> parameter means. Round-trip conversions between the two
> representations become error-prone due to repeated rounding -- this
> proved quite fiddly to get right. Precision beyond 1% increments may
> also be available in the hardware, but is not accessible through the
> resctrl interface.
Google users got annoyed with these rounding errors very quickly and
asked me to change the MBA interface to the raw, fixed-point value
used by the MPAM register interface. (but at least shifted down, since
the MBW_MIN/MAX fields are left-justified)
>
> For backwards compatibility we probably shouldn't change that
> particular interface, but if we can avoid new instances of the same
> kind of problem then that would be a benefit: i.e., explicitly tell
> userspace how to scale a given parameter.
MBA is not programmed by percentage on AMD, so I'm not sure why this
is considered necessary for backwards compatibility.
-Peter
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events
2025-04-29 0:33 ` [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events Tony Luck
2025-05-08 3:30 ` Reinette Chatre
@ 2025-05-09 15:02 ` Peter Newman
1 sibling, 0 replies; 72+ messages in thread
From: Peter Newman @ 2025-05-09 15:02 UTC (permalink / raw)
To: Tony Luck
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On Tue, Apr 29, 2025 at 2:34 AM Tony Luck <tony.luck@intel.com> wrote:
>
> There's a rule in computer programming that objects appear zero,
> once, or many times. So code accordingly.
>
> There are two MBM events and resctrl is coded with a lot of
>
> if (local)
> do one thing
> if (total)
> do a different thing
>
> Change the rdt_ctrl_domain and rdt_hw_mon_domain structures to hold
> arrays of pointers to per event data instead of explicit fields for
> total and local bandwidth.
>
> Simplify the code by coding for many events using loops on
> which are enabled.
>
> Move resctrl_is_mbm_event() to <linux/resctrl.h> so it
> can be used more widely. Also provide a for_each_mbm_event()
> helper macro.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 15 +++++---
> include/linux/resctrl_types.h | 3 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 6 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 38 ++++++++++----------
> arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++++++--------
> fs/resctrl/monitor.c | 13 ++++---
> fs/resctrl/rdtgroup.c | 48 ++++++++++++--------------
> 7 files changed, 84 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 3c5d111aae65..cef9b0ed984c 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -161,8 +161,7 @@ struct rdt_ctrl_domain {
> * @hdr: common header for different domain types
> * @ci: cache info for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> - * @mbm_total: saved state for MBM total bandwidth
> - * @mbm_local: saved state for MBM local bandwidth
> + * @mbm_states: saved state for each QOS MBM event
> * @mbm_over: worker to periodically read MBM h/w counters
> * @cqm_limbo: worker to periodically read CQM h/w counters
> * @mbm_work_cpu: worker CPU for MBM h/w counters
> @@ -172,8 +171,7 @@ struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> struct cacheinfo *ci;
> unsigned long *rmid_busy_llc;
> - struct mbm_state *mbm_total;
> - struct mbm_state *mbm_local;
> + struct mbm_state *mbm_states[QOS_NUM_MBM_EVENTS];
> struct delayed_work mbm_over;
> struct delayed_work cqm_limbo;
> int mbm_work_cpu;
> @@ -376,6 +374,15 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid);
> bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
>
> +static inline bool resctrl_is_mbm_event(int e)
> +{
> + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> + e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +#define for_each_mbm_event(evt) \
> + for (evt = QOS_L3_MBM_TOTAL_EVENT_ID; evt <= QOS_L3_MBM_LOCAL_EVENT_ID; evt++)
> +
> /**
> * resctrl_arch_mon_event_config_write() - Write the config for an event.
> * @config_info: struct resctrl_mon_config_info describing the resource, domain
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index a25fb9c4070d..5ef14a24008c 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -47,4 +47,7 @@ enum resctrl_event_id {
> QOS_NUM_EVENTS,
> };
>
> +#define QOS_NUM_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> +#define MBM_EVENT_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
> +
> #endif /* __LINUX_RESCTRL_TYPES_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 5e3c41b36437..02b535c828f3 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -54,15 +54,13 @@ struct rdt_hw_ctrl_domain {
> * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> * a resource for a monitor function
> * @d_resctrl: Properties exposed to the resctrl file system
> - * @arch_mbm_total: arch private state for MBM total bandwidth
> - * @arch_mbm_local: arch private state for MBM local bandwidth
> + * @arch_mbm_states: arch private state for each MBM event
> *
> * Members of this structure are accessed via helpers that provide abstraction.
> */
> struct rdt_hw_mon_domain {
> struct rdt_mon_domain d_resctrl;
> - struct arch_mbm_state *arch_mbm_total;
> - struct arch_mbm_state *arch_mbm_local;
> + struct arch_mbm_state *arch_mbm_states[QOS_NUM_MBM_EVENTS];
> };
>
> static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 819bc7a09327..e5c91d21e8f7 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -364,8 +364,8 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
>
> static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
> {
> - kfree(hw_dom->arch_mbm_total);
> - kfree(hw_dom->arch_mbm_local);
> + for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++)
> + kfree(hw_dom->arch_mbm_states[i]);
> kfree(hw_dom);
> }
>
> @@ -399,25 +399,27 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
> */
> static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
> {
> - size_t tsize;
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> - tsize = sizeof(*hw_dom->arch_mbm_total);
> - hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
> - if (!hw_dom->arch_mbm_total)
> - return -ENOMEM;
> - }
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
> - tsize = sizeof(*hw_dom->arch_mbm_local);
> - hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
> - if (!hw_dom->arch_mbm_local) {
> - kfree(hw_dom->arch_mbm_total);
> - hw_dom->arch_mbm_total = NULL;
> - return -ENOMEM;
> - }
> + size_t tsize = sizeof(struct arch_mbm_state);
> + enum resctrl_event_id evt;
> + int idx;
> +
> + for_each_mbm_event(evt) {
> + if (!resctrl_is_mon_event_enabled(evt))
> + continue;
> + idx = MBM_EVENT_IDX(evt);
> + hw_dom->arch_mbm_states[idx] = kcalloc(num_rmid, tsize, GFP_KERNEL);
> + if (!hw_dom->arch_mbm_states[idx])
> + goto cleanup;
> }
>
> return 0;
> +cleanup:
> + while (--idx >= 0) {
> + kfree(hw_dom->arch_mbm_states[idx]);
> + hw_dom->arch_mbm_states[idx] = NULL;
> + }
> +
> + return -ENOMEM;
> }
>
> static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index fda579251dba..bf7fde07846b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,18 +160,21 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
> u32 rmid,
> enum resctrl_event_id eventid)
> {
> + struct arch_mbm_state *state;
> +
> switch (eventid) {
> - case QOS_L3_OCCUP_EVENT_ID:
> - return NULL;
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &hw_dom->arch_mbm_total[rmid];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &hw_dom->arch_mbm_local[rmid];
> default:
> /* Never expect to get here */
> WARN_ON_ONCE(1);
> + fallthrough;
> + case QOS_L3_OCCUP_EVENT_ID:
> return NULL;
> + case QOS_L3_MBM_TOTAL_EVENT_ID:
> + case QOS_L3_MBM_LOCAL_EVENT_ID:
> + state = hw_dom->arch_mbm_states[MBM_EVENT_IDX(eventid)];
> }
> +
> + return state ? &state[rmid] : NULL;
> }
>
> void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> @@ -200,14 +203,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_total, 0,
> - sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_local, 0,
> - sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
> + enum resctrl_event_id evt;
> + int idx;
> +
> + for_each_mbm_event(evt) {
> + idx = MBM_EVENT_IDX(evt);
> + if (!hw_dom->arch_mbm_states[idx])
> + continue;
> + memset(hw_dom->arch_mbm_states[idx], 0,
> + sizeof(struct arch_mbm_state) * r->num_rmid);
> + }
> }
>
> static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 7de4e219dba3..ef33970166af 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -346,15 +346,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> u32 rmid, enum resctrl_event_id evtid)
> {
> u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> + struct mbm_state *states;
>
> - switch (evtid) {
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &d->mbm_total[idx];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &d->mbm_local[idx];
> - default:
> + if (!resctrl_is_mbm_event(evtid))
> return NULL;
> - }
> +
> + states = d->mbm_states[MBM_EVENT_IDX(evtid)];
> +
> + return states ? &states[idx] : NULL;
> }
>
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 4a092c305255..c06752dfcb7c 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -127,12 +127,6 @@ static bool resctrl_is_mbm_enabled(void)
> resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID));
> }
>
> -static bool resctrl_is_mbm_event(int e)
> -{
> - return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> - e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> -}
> -
> /*
> * Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap
> * of free CLOSIDs.
> @@ -4019,8 +4013,10 @@ static void rdtgroup_setup_default(void)
> static void domain_destroy_mon_state(struct rdt_mon_domain *d)
> {
> bitmap_free(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - kfree(d->mbm_local);
> + for (int i = 0; i < QOS_NUM_MBM_EVENTS; i++) {
> + kfree(d->mbm_states[i]);
> + d->mbm_states[i] = NULL;
> + }
> }
>
> void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> @@ -4080,32 +4076,34 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> - size_t tsize;
> + size_t tsize = sizeof(struct mbm_state);
> + enum resctrl_event_id evt;
> + int idx;
>
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
> d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
> if (!d->rmid_busy_llc)
> return -ENOMEM;
> }
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> - tsize = sizeof(*d->mbm_total);
> - d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
> - if (!d->mbm_total) {
> - bitmap_free(d->rmid_busy_llc);
> - return -ENOMEM;
> - }
> - }
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
> - tsize = sizeof(*d->mbm_local);
> - d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
> - if (!d->mbm_local) {
> - bitmap_free(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - return -ENOMEM;
> - }
> +
> + for_each_mbm_event(evt) {
> + if (!resctrl_is_mon_event_enabled(evt))
> + continue;
> + idx = MBM_EVENT_IDX(evt);
> + d->mbm_states[idx] = kcalloc(idx_limit, tsize, GFP_KERNEL);
> + if (!d->mbm_states[idx])
> + goto cleanup;
> }
>
> return 0;
> +cleanup:
> + bitmap_free(d->rmid_busy_llc);
> + while (--idx >= 0) {
> + kfree(d->mbm_states[idx]);
> + d->mbm_states[idx] = NULL;
> + }
> +
> + return -ENOMEM;
> }
>
> int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> --
> 2.48.1
>
I tried out this change to field an urgent internal request to
experimentally get MBM and occupancy events broken down by code/data
on an MPAM implementation. Because the approach to CDP used in the
MPAM driver allocates two IDs per group, code/data in all of the
events is naturally monitored separately, but the MPAM
resctrl_arch_rmid_read() implementation adds them back together[1]
before returning.
This may not ultimately be how code/data portions of monitoring events
are reported in the long run, since it seems troubling that we would
have to add two new derived events for every additional "real" event
which could be added later on.
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index 6b2191672eb87..9c421ad836ee1 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -94,9 +94,15 @@ enum resctrl_event_id {
QOS_L3_OCCUP_EVENT_ID = 0x01,
QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
+ QOS_L3_MBM_CODE_TOTAL_EVENT_ID,
+ QOS_L3_MBM_DATA_TOTAL_EVENT_ID,
+ QOS_L3_MBM_CODE_LOCAL_EVENT_ID,
+ QOS_L3_MBM_DATA_LOCAL_EVENT_ID,
+ QOS_L3_CODE_OCCUP_EVENT_ID,
+ QOS_L3_DATA_OCCUP_EVENT_ID,
};
But I'm at least happy to report that I didn't need to make any
substantial changes to this patch to make this experiment work. The
main difference was needing to adjust the range of MBM event IDs.
Thanks!
-Peter
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/drivers/platform/arm64/mpam/mpam_resctrl.c?h=mpam/snapshot/v6.14-rc1#n402
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-05-09 14:46 ` Peter Newman
@ 2025-05-09 16:38 ` Luck, Tony
2025-05-09 16:43 ` Dave Martin
1 sibling, 0 replies; 72+ messages in thread
From: Luck, Tony @ 2025-05-09 16:38 UTC (permalink / raw)
To: Peter Newman
Cc: Dave Martin, Reinette Chatre, Fenghua Yu, Maciej Wieczor-Retman,
James Morse, Babu Moger, Drew Fustini, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, May 09, 2025 at 04:46:30PM +0200, Peter Newman wrote:
> Hi Dave,
>
> On Fri, May 9, 2025 at 1:29 PM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > Hi,
> >
> > (Backtrace retained for context -- see my comment at the end.)
> >
> > Cheers
> > ---Dave
> >
> > [...]
> >
> > On Thu, May 08, 2025 at 04:45:21PM -0700, Reinette Chatre wrote:
> > > Hi Tony,
> > >
> > > On 5/8/25 1:28 PM, Luck, Tony wrote:
> > > > On Thu, May 08, 2025 at 08:49:56AM -0700, Reinette Chatre wrote:
> > > >> On 4/28/25 5:33 PM, Tony Luck wrote:
> > > >>> Resctrl was written with the assumption that all monitor events
> > > >>> can be displayed as unsigned decimal integers.
> > > >>>
> > > >>> Some telemetry events provide greater precision where architecture code
> > > >>> uses a fixed point format with 18 binary places.
> > > >>>
> > > >>> Add a "display_format" field to struct mon_evt which can specify
> > > >>> that the value for the event be displayed as an integer for legacy
> > > >>> events, or as a floating point value with six decimal places converted
> > > >>> from the fixed point format received from architecture code.
> > > >>
> > > >> There was no discussion on this during the previous version.
> > > >> While this version addresses the issue of architecture changing the
> > > >> format it does not address the issue of how to handle different
> > > >> architecture formats. With this change any architecture that may
> > > >> want to support any of these events will be required to translate
> > > >> whatever format it uses into the one Intel uses to be translated
> > > >> again into format for user space. Do you think this is reasonable?
> > > >>
> > > >> Alternatively, resctrl could add additional file that contains the
> > > >> format so that if an architecture in the future needs to present data
> > > >> differently, an interface will exist to guide userspace how to parse it.
> > > >> Creation of such user interface cannot be delayed until the time
> > > >> it is needed since then these formats would be ABI.
> > > >
> > > > What if resctrl filesystem allows architecture to supply the number
> > > > of binary places for fixed point values when enabling an event?
> > >
> > > This sounds good. I do not think we are in a position to come up with
> > > an ideal solution. That would require assumptions of what another
> > > architecture may or may not do and thus we do not have complete information.
> > >
> > > >
> > > > That would allow h/w implementations to pick an appropriate precision
> > > > for each new event. Different implementations of the same event
> > > > (e.g. "core_energy") may pick different precision across architectures
> > > > or between generations of the same architecture.
> > > >
> > > > File system code can then do:
> > > >
> > > > if (binary_places == 0)
> > > > display as integer
> > > > else
> > > > convert to floating point (with one decimal place per
> > > > three binary places)
> > >
> > > I do not think this problem needs to be solved in this work but there needs
> > > to be a plan for how other architectures can be supported. When similar
> > > enabling needs to be done for that hypothetical architecture then it can
> > > be implemented ... if it is still valid based on what that architecture actually
> > > supports.
> > > It may be sufficient for the "plan" (as above) to be in comments.
> > >
> > > >
> > > > Existing events are all integers and won't change (it would be weird
> > > > for an architecture to report "mbm_local_bytes" with a fixed point
> > > > rather than integer value).
> > > >
> > > > New events may report in either integer or floating point format
> > > > with varying amounts of precision. But I'm not sure that would be
> > >
> > > Partly this will depend on the unit of measurement that should form part of
> > > the definition of the event. For example, events reporting cycles or ticks
> > > should only be integer, no?
> > >
> > > > a burden for writing tools that can run on different architectures.
> > >
> > > Maybe just a comment in the docs then ... and now I see that you did
> > > so already. My apologies, I did not get to the last four patches.
> > >
> > > Reinette
> >
> > Just a thought, but I think that while it's not possible to be fully
> > generic, a parameter model along the lines of
> >
> > quantity = raw_value * ((double)multiplier / divisor) * BASE_UNIT
> >
> > would cover most things that we have or can reasonably foresee,
> > including memory bandwidth control values.
> > raw_value, multiplier and divisor would all be integers.
> >
> > Since raw_integer can be the value used by the hardware, its precision
> > can probably be fixed at 1, though we could still report it explicitly.
> >
> > Fundamental base units would be things like "byte", "bytes per second"
> > and "none" (i.e., dimensionless quantities). (Are there others?)
The energy telemetry counters implemented in this series have:
core_energy: Units are Joules
activity: Units are Farads (Dynamic Capicitance or "CDyn")
Each of these is reported by h/w as a fixed-point binary value with 18
binary places. I'm proposing reporing these as floating point decimal
value with six decimal places (since 1/2^18 ~= 0.0000038)
Loss of precision from conversion to decimal is likely far smaller than
the error bars on the estimation of these values).
> >
> > Since we cannot guess for certain what userspace wants to do with the
> > values, it feels better to let userspace do any scaling calculations
> > itself, rather than trying to prettify the interface.
> >
> > For example: scaling memory bandwidth percentages for MPAM is a
> > nuisance because the hardware uses fixed-point values scaled by a power
> > of 2, not by 100: the two scales can never match up anywhere except at
> > multiples of 25%, leading to irregular increments when rounded to an
> > integer percentage value and uncertainty about what the bandwidth_gran
> > parameter means. Round-trip conversions between the two
> > representations become error-prone due to repeated rounding -- this
> > proved quite fiddly to get right. Precision beyond 1% increments may
> > also be available in the hardware, but is not accessible through the
> > resctrl interface.
Sounds like my fixed-point proposal could be useful for these memory
bandwidth values.
> Google users got annoyed with these rounding errors very quickly and
> asked me to change the MBA interface to the raw, fixed-point value
> used by the MPAM register interface. (but at least shifted down, since
> the MBW_MIN/MAX fields are left-justified)
>
> >
> > For backwards compatibility we probably shouldn't change that
> > particular interface, but if we can avoid new instances of the same
> > kind of problem then that would be a benefit: i.e., explicitly tell
> > userspace how to scale a given parameter.
>
> MBA is not programmed by percentage on AMD, so I'm not sure why this
> is considered necessary for backwards compatibility.
Upcoming "region aware" Intel RDT features also present challenges for
backward compatibility as there are going to be separate counters and
controls for each region. Maintaining the use of "percentage" for
controls only gives feeling of familiarity while any tools that are
using the "MB:" line in the schemata file aren't going to work at all.
Sadly, we won't have a direct "MBytes/second" interface (though our
goal is to get to that some day). H/w interface values for throttling
change from legacy 0,10,20,...90 (no throttle up to max) to 255,254...1
(max bandwidth to min).
There's also the bonus feature that memory bandwidth limits can
specify min and max values so hardware can grant jobs some amount
of extra bandwidth when the system is not busy, or throttle just
the low priority jobs when approaching capacity limits.
> -Peter
-Tony
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats
2025-05-09 14:46 ` Peter Newman
2025-05-09 16:38 ` Luck, Tony
@ 2025-05-09 16:43 ` Dave Martin
1 sibling, 0 replies; 72+ messages in thread
From: Dave Martin @ 2025-05-09 16:43 UTC (permalink / raw)
To: Peter Newman
Cc: Reinette Chatre, Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman,
James Morse, Babu Moger, Drew Fustini, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi,
On Fri, May 09, 2025 at 04:46:30PM +0200, Peter Newman wrote:
> Hi Dave,
>
> On Fri, May 9, 2025 at 1:29 PM Dave Martin <Dave.Martin@arm.com> wrote:
[...]
> > For example: scaling memory bandwidth percentages for MPAM is a
> > nuisance because the hardware uses fixed-point values scaled by a power
> > of 2, not by 100: the two scales can never match up anywhere except at
> > multiples of 25%, leading to irregular increments when rounded to an
> > integer percentage value and uncertainty about what the bandwidth_gran
> > parameter means. Round-trip conversions between the two
> > representations become error-prone due to repeated rounding -- this
> > proved quite fiddly to get right. Precision beyond 1% increments may
> > also be available in the hardware, but is not accessible through the
> > resctrl interface.
>
> Google users got annoyed with these rounding errors very quickly and
> asked me to change the MBA interface to the raw, fixed-point value
> used by the MPAM register interface. (but at least shifted down, since
> the MBW_MIN/MAX fields are left-justified)
That's interesting.
Do you find a need to do things like step the bandwidth allocation for
a control group? So, as part of a tuning regime, the bandwidth value
is read out, stepped to the next distinct hardware value and written
back in?
That kind of thing does not map in a convenient way onto the current
interface, although fire-and-forget programming of a predetermined
percentage works fine.
Extending my model outline, a 6-bit MPAM MBW_PART implementation might
be described by:
min: 1
max: 64
step size: 1
multiplier: 1
divisor: 64
How easy / difficult do you think it would be for userspace to work
with this, if resctrlfs were to expose the raw control (minus the
ignored bits) with that metadata?
Needless to say, the max and divisor values would dependent on the
hardware and possibly other factors. They would be fixed for the
lifetime of a single resctrl instance at the very least.
> > For backwards compatibility we probably shouldn't change that
> > particular interface, but if we can avoid new instances of the same
> > kind of problem then that would be a benefit: i.e., explicitly tell
> > userspace how to scale a given parameter.
>
> MBA is not programmed by percentage on AMD, so I'm not sure why this
> is considered necessary for backwards compatibility.
I presumed scripts (or pre-tuned data fed through them) are in practice
pretty platform-specific, so that it will upset people if the interface
changes between kernel versions at least on a given hardware family.
The divergence between AMD and Intel in this area is unfortunate, but
absolute and proportional bandwidth measures do not really seem to be
interchangeable -- so a truly unified interface may not be easy to
achieve either.
Having two control names in the interface might work, say:
MBP: proportion of total available memory bandwidth (%)
MBA: absolute memory bandwidth (B/s)
Then just expose the one that the hardware implements natively (while
still exposing MB as a backwards compatible alias if necessary).
Cheers
---Dave
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events
2025-04-29 0:33 ` [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
2025-05-08 3:32 ` Reinette Chatre
@ 2025-05-10 9:58 ` Chen, Yu C
2025-05-12 14:19 ` Luck, Tony
1 sibling, 1 reply; 72+ messages in thread
From: Chen, Yu C @ 2025-05-10 9:58 UTC (permalink / raw)
To: Tony Luck
Cc: x86, linux-kernel, patches, Peter Newman, James Morse,
Dave Martin, Babu Moger, Anil Keshavamurthy, Drew Fustini,
Fenghua Yu, Maciej Wieczor-Retman, Reinette Chatre
Hi Tony,
On 4/29/2025 8:33 AM, Tony Luck wrote:
> Intel RMID based telemetry events are counted by each CPU core
> and then aggregated by one or more per-socket micro controllers.
> Enumeration support is provided by the Intel PMT subsystem.
>
> N.B. Patches for the Intel PMT system are still in progress.
> They will define an INTEL_PMT_DISCOVERY Kconfig symbol that
> will be one of the dependencies. This is commented out for
> now. Final version will include this dependency.
>
> arch/x86 selects this option based on:
>
> X86_64: Counter registers are in MMIO space. There is no readq()
> function on 32-bit. Emulation is possible with readl(), but there
> are races. Running 32-bit kernels on systems that support this
> feature seems pointless.
>
> CPU_SUP_INTEL: It is an Intel specific feature.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/Kconfig | 1 +
> drivers/platform/x86/intel/pmt/Kconfig | 7 +++++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5a09acf41c8e..19107fdb4264 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -508,6 +508,7 @@ config X86_CPU_RESCTRL
> bool "x86 CPU resource control support"
> depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> depends on MISC_FILESYSTEMS
> + select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL)
Not sure if if it is expected, I got the following warning
during compiling:
WARNING: unmet direct dependencies detected for INTEL_AET_RESCTRL
Depends on [n]: X86_PLATFORM_DEVICES [=y] && INTEL_PMT_TELEMETRY [=n]
Selected by [y]:
- X86_CPU_RESCTRL [=y] && X86 [=y] && (CPU_SUP_INTEL [=y] ||
CPU_SUP_AMD [=y]) && MISC_FILESYSTEMS [=y] && X86_64 [=y] &&
CPU_SUP_INTEL [=y]
I think this is because the INTEL_PMT_TELEMETRY is disabled.
Does it make sense to add the dependency of INTEL_PMT_TELEMETRY
to auto-select for INTEL_AET_RESCTRL?
select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY)
thanks,
Chenyu
> select ARCH_HAS_CPU_RESCTRL
> select RESCTRL_FS
> select RESCTRL_FS_PSEUDO_LOCK
> diff --git a/drivers/platform/x86/intel/pmt/Kconfig b/drivers/platform/x86/intel/pmt/Kconfig
> index e916fc966221..3a8ce39d1004 100644
> --- a/drivers/platform/x86/intel/pmt/Kconfig
> +++ b/drivers/platform/x86/intel/pmt/Kconfig
> @@ -38,3 +38,10 @@ config INTEL_PMT_CRASHLOG
>
> To compile this driver as a module, choose M here: the module
> will be called intel_pmt_crashlog.
> +
> +config INTEL_AET_RESCTRL
> + depends on INTEL_PMT_TELEMETRY # && INTEL_PMT_DISCOVERY
> + bool
> + help
> + Architecture config should "select" this option to enable
> + support for RMID telemetry events in the resctrl file system.
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events
2025-05-10 9:58 ` Chen, Yu C
@ 2025-05-12 14:19 ` Luck, Tony
0 siblings, 0 replies; 72+ messages in thread
From: Luck, Tony @ 2025-05-12 14:19 UTC (permalink / raw)
To: Chen, Yu C
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev, Peter Newman, James Morse, Dave Martin,
Babu Moger, Keshavamurthy, Anil S, Drew Fustini, Fenghua Yu,
Wieczor-Retman, Maciej, Chatre, Reinette
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 5a09acf41c8e..19107fdb4264 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -508,6 +508,7 @@ config X86_CPU_RESCTRL
> > bool "x86 CPU resource control support"
> > depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> > depends on MISC_FILESYSTEMS
> > + select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL)
>
> Not sure if if it is expected, I got the following warning
> during compiling:
> WARNING: unmet direct dependencies detected for INTEL_AET_RESCTRL
> Depends on [n]: X86_PLATFORM_DEVICES [=y] && INTEL_PMT_TELEMETRY [=n]
> Selected by [y]:
> - X86_CPU_RESCTRL [=y] && X86 [=y] && (CPU_SUP_INTEL [=y] ||
> CPU_SUP_AMD [=y]) && MISC_FILESYSTEMS [=y] && X86_64 [=y] &&
> CPU_SUP_INTEL [=y]
>
> I think this is because the INTEL_PMT_TELEMETRY is disabled.
> Does it make sense to add the dependency of INTEL_PMT_TELEMETRY
> to auto-select for INTEL_AET_RESCTRL?
>
> select INTEL_AET_RESCTRL if (X86_64 && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY)
I'll try to get some proper solution once the OOBMSM driver updates are
upstream and I don't need to provide fake interfaces. Latest version of
those patches are here[1].
Those patches add the "PMT discovery driver" which is the part that provides
the intel_pmt_get_regions_by_feature() interface. Resctrl will need to depend
on (or select) INTEP_PMT_DISCOVERY=y
-Tony
[1] https://lore.kernel.org/all/20250430212106.369208-1-david.e.box@linux.intel.com/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU
2025-04-29 0:33 ` [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
2025-05-08 3:54 ` Reinette Chatre
@ 2025-05-13 3:19 ` Chen, Yu C
2025-05-13 16:20 ` Luck, Tony
1 sibling, 1 reply; 72+ messages in thread
From: Chen, Yu C @ 2025-05-13 3:19 UTC (permalink / raw)
To: Tony Luck, Reinette Chatre
Cc: x86, linux-kernel, patches, Fenghua Yu, Maciej Wieczor-Retman,
Peter Newman, Babu Moger, Anil Keshavamurthy, Dave Martin,
James Morse, Drew Fustini
Hi Tony,
On 4/29/2025 8:33 AM, Tony Luck wrote:
> Resctrl file system code was built with the assumption that monitor
> events can only be read from a CPU in the cpumast_t set for each
> domain. This was true for x86 events accessed with an MSR interface,
> but may not be true for other access methods such as MMIO.
>
> Add a flag to each instance of struct mon_evt that can be set by
> architecture code to indicate there is no restriction on which
> CPU can read the event counter.
>
> Change struct mon_data and struct rmid_read to have a pointer to
> the struct mon_evt instead of the event id.
>
> Add an extra argument to resctrl_enable_mon_event() so architecture
> code can indicate which events can be read on any CPU when enabling
> the event.
>
> Bypass all the smp_call*() code for events that can be read on any CPU
> and call mon_event_count() directly from mon_event_read().
>
> Skip checks in __mon_event_count() that the read is being done from
> a CPU in the correct domain or cache scope.
>
Since __mon_event_count() was supposed to run in atomic context, the
smp_processor_id() would not report any warning previously. After
this change, if the evt->any_cpu is true, we read the telemetry counter
directly without IPI involved and in non-atomic context, we might
get warning like below:
BUG: using smp_processor_id() in preemptible [00000000] code: mount/1595
caller is __mon_event_count+0x2e/0x1e0
2483 [ 2095.332850] Call Trace:
2484 [ 2095.332861] <TASK>
2485 [ 2095.332872] dump_stack_lvl+0x55/0x70
2486 [ 2095.332887] check_preemption_disabled+0xbf/0xe0
2487 [ 2095.332902] __mon_event_count+0x2e/0x1e0
2488 [ 2095.332918] mon_event_count+0x2a/0xa0
2489 [ 2095.332934] mon_add_all_files+0x202/0x270
2490 [ 2095.332953] mkdir_mondata_subdir+0x1bf/0x1e0
2491 [ 2095.332970] ? kcore_update_ram.isra.0+0x270/0x270
2492 [ 2095.332985] mkdir_mondata_all+0x9d/0x100
2493 [ 2095.333000] rdt_get_tree+0x336/0x5d0
2494 [ 2095.333014] vfs_get_tree+0x26/0xf0
2495 [ 2095.333028] do_new_mount+0x186/0x350
2496 [ 2095.333044] __x64_sys_mount+0x101/0x130
2497 [ 2095.333061] do_syscall_64+0x54/0xd70
2498 [ 2095.333075] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Maybe avoid getting the CPU at all in __mon_event_count() if
evt->any_cpu is true?
thanks,
Chenyu
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index d9364bee486e..32385c811a92 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -358,12 +358,15 @@ static struct mbm_state *get_mbm_state(struct
rdt_l3_mon_domain *d, u32 closid,
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
- int cpu = smp_processor_id();
struct rdt_l3_mon_domain *d;
struct mbm_state *m;
- int err, ret;
+ int err, ret, cpu;
u64 tval = 0;
+ /*only CPU sensitive event read cares about which CPU to read
from */
+ if (!rr->evt->any_cpu)
+ cpu = smp_processor_id();
tele
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU
2025-05-13 3:19 ` Chen, Yu C
@ 2025-05-13 16:20 ` Luck, Tony
2025-05-14 9:11 ` Chen, Yu C
0 siblings, 1 reply; 72+ messages in thread
From: Luck, Tony @ 2025-05-13 16:20 UTC (permalink / raw)
To: Chen, Yu C
Cc: Reinette Chatre, x86, linux-kernel, patches, Fenghua Yu,
Maciej Wieczor-Retman, Peter Newman, Babu Moger,
Anil Keshavamurthy, Dave Martin, James Morse, Drew Fustini
On Tue, May 13, 2025 at 11:19:23AM +0800, Chen, Yu C wrote:
Thanks for the bug report.
> get warning like below:
> BUG: using smp_processor_id() in preemptible [00000000] code: mount/1595
> caller is __mon_event_count+0x2e/0x1e0
> 2483 [ 2095.332850] Call Trace:
> 2484 [ 2095.332861] <TASK>
> 2485 [ 2095.332872] dump_stack_lvl+0x55/0x70
> 2486 [ 2095.332887] check_preemption_disabled+0xbf/0xe0
> 2487 [ 2095.332902] __mon_event_count+0x2e/0x1e0
> 2488 [ 2095.332918] mon_event_count+0x2a/0xa0
> 2489 [ 2095.332934] mon_add_all_files+0x202/0x270
> 2490 [ 2095.332953] mkdir_mondata_subdir+0x1bf/0x1e0
> 2491 [ 2095.332970] ? kcore_update_ram.isra.0+0x270/0x270
> 2492 [ 2095.332985] mkdir_mondata_all+0x9d/0x100
> 2493 [ 2095.333000] rdt_get_tree+0x336/0x5d0
> 2494 [ 2095.333014] vfs_get_tree+0x26/0xf0
> 2495 [ 2095.333028] do_new_mount+0x186/0x350
> 2496 [ 2095.333044] __x64_sys_mount+0x101/0x130
> 2497 [ 2095.333061] do_syscall_64+0x54/0xd70
> 2498 [ 2095.333075] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Hmmm. You are right, but I didn't see this. Perhaps it only shows
if CONFIG_DEBUG_PREEMPT is set?
> Maybe avoid getting the CPU at all in __mon_event_count() if
> evt->any_cpu is true?
>
> thanks,
> Chenyu
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index d9364bee486e..32385c811a92 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -358,12 +358,15 @@ static struct mbm_state *get_mbm_state(struct
> rdt_l3_mon_domain *d, u32 closid,
>
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> {
> - int cpu = smp_processor_id();
> struct rdt_l3_mon_domain *d;
> struct mbm_state *m;
> - int err, ret;
> + int err, ret, cpu;
> u64 tval = 0;
>
> + /*only CPU sensitive event read cares about which CPU to read from
> */
> + if (!rr->evt->any_cpu)
> + cpu = smp_processor_id();
>
> tele
I might fix with a helper just in case some compiler doesn't keep track
and issues a "may be used before set" warning.
-Tony
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index ddfc1c5f60d6..6041cb304624 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -356,9 +356,24 @@ static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
return states ? &states[idx] : NULL;
}
+static bool cpu_on_wrong_domain(struct rmid_read *rr)
+{
+ cpumask_t *mask;
+
+ if (rr->evt->any_cpu)
+ return false;
+
+ /*
+ * When reading from a specific domain the CPU must be in that
+ * domain. Otherwise the CPU must be one that shares the cache.
+ */
+ mask = rr->d ? &rr->d->hdr.cpu_mask : &rr->ci->shared_cpu_map;
+
+ return !cpumask_test_cpu(smp_processor_id(), mask);
+}
+
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
- int cpu = smp_processor_id();
struct rdt_l3_mon_domain *d;
struct mbm_state *m;
int err, ret;
@@ -373,11 +388,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
}
if (rr->d) {
- /*
- * Unless this event can be read from any CPU, check
- * that execution is on a CPU in the domain.
- */
- if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
+ if (cpu_on_wrong_domain(rr))
return -EINVAL;
rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
rr->evt->evtid, &tval, rr->arch_mon_ctx);
@@ -389,11 +400,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
return 0;
}
- /*
- * Unless this event can be read from any CPU, check that
- * execution is on a CPU that shares the cache.
- */
- if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
+ if (cpu_on_wrong_domain(rr))
return -EINVAL;
/*
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU
2025-05-13 16:20 ` Luck, Tony
@ 2025-05-14 9:11 ` Chen, Yu C
0 siblings, 0 replies; 72+ messages in thread
From: Chen, Yu C @ 2025-05-14 9:11 UTC (permalink / raw)
To: Luck, Tony
Cc: Reinette Chatre, x86, linux-kernel, patches, Fenghua Yu,
Maciej Wieczor-Retman, Peter Newman, Babu Moger,
Anil Keshavamurthy, Dave Martin, James Morse, Drew Fustini
On 5/14/2025 12:20 AM, Luck, Tony wrote:
> On Tue, May 13, 2025 at 11:19:23AM +0800, Chen, Yu C wrote:
>
> Thanks for the bug report.
>
>> get warning like below:
>> BUG: using smp_processor_id() in preemptible [00000000] code: mount/1595
>> caller is __mon_event_count+0x2e/0x1e0
>> 2483 [ 2095.332850] Call Trace:
>> 2484 [ 2095.332861] <TASK>
>> 2485 [ 2095.332872] dump_stack_lvl+0x55/0x70
>> 2486 [ 2095.332887] check_preemption_disabled+0xbf/0xe0
>> 2487 [ 2095.332902] __mon_event_count+0x2e/0x1e0
>> 2488 [ 2095.332918] mon_event_count+0x2a/0xa0
>> 2489 [ 2095.332934] mon_add_all_files+0x202/0x270
>> 2490 [ 2095.332953] mkdir_mondata_subdir+0x1bf/0x1e0
>> 2491 [ 2095.332970] ? kcore_update_ram.isra.0+0x270/0x270
>> 2492 [ 2095.332985] mkdir_mondata_all+0x9d/0x100
>> 2493 [ 2095.333000] rdt_get_tree+0x336/0x5d0
>> 2494 [ 2095.333014] vfs_get_tree+0x26/0xf0
>> 2495 [ 2095.333028] do_new_mount+0x186/0x350
>> 2496 [ 2095.333044] __x64_sys_mount+0x101/0x130
>> 2497 [ 2095.333061] do_syscall_64+0x54/0xd70
>> 2498 [ 2095.333075] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Hmmm. You are right, but I didn't see this. Perhaps it only shows
> if CONFIG_DEBUG_PREEMPT is set?
>
Yes, CONFIG_DEBUG_PREEMPT checks that.
>> Maybe avoid getting the CPU at all in __mon_event_count() if
>> evt->any_cpu is true?
>>
>> thanks,
>> Chenyu
>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
>> index d9364bee486e..32385c811a92 100644
>> --- a/fs/resctrl/monitor.c
>> +++ b/fs/resctrl/monitor.c
>> @@ -358,12 +358,15 @@ static struct mbm_state *get_mbm_state(struct
>> rdt_l3_mon_domain *d, u32 closid,
>>
>> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
>> {
>> - int cpu = smp_processor_id();
>> struct rdt_l3_mon_domain *d;
>> struct mbm_state *m;
>> - int err, ret;
>> + int err, ret, cpu;
>> u64 tval = 0;
>>
>> + /*only CPU sensitive event read cares about which CPU to read from
>> */
>> + if (!rr->evt->any_cpu)
>> + cpu = smp_processor_id();
>>
>> tele
>
> I might fix with a helper just in case some compiler doesn't keep track
> and issues a "may be used before set" warning.
>
The following fix looks good to me. I'll apply it in the internal tree
to continue the test.
thanks,
Chenyu
> -Tony
>
>
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index ddfc1c5f60d6..6041cb304624 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -356,9 +356,24 @@ static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
> return states ? &states[idx] : NULL;
> }
>
> +static bool cpu_on_wrong_domain(struct rmid_read *rr)
> +{
> + cpumask_t *mask;
> +
> + if (rr->evt->any_cpu)
> + return false;
> +
> + /*
> + * When reading from a specific domain the CPU must be in that
> + * domain. Otherwise the CPU must be one that shares the cache.
> + */
> + mask = rr->d ? &rr->d->hdr.cpu_mask : &rr->ci->shared_cpu_map;
> +
> + return !cpumask_test_cpu(smp_processor_id(), mask);
> +}
> +
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> {
> - int cpu = smp_processor_id();
> struct rdt_l3_mon_domain *d;
> struct mbm_state *m;
> int err, ret;
> @@ -373,11 +388,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> }
>
> if (rr->d) {
> - /*
> - * Unless this event can be read from any CPU, check
> - * that execution is on a CPU in the domain.
> - */
> - if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
> + if (cpu_on_wrong_domain(rr))
> return -EINVAL;
> rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
> rr->evt->evtid, &tval, rr->arch_mon_ctx);
> @@ -389,11 +400,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> return 0;
> }
>
> - /*
> - * Unless this event can be read from any CPU, check that
> - * execution is on a CPU that shares the cache.
> - */
> - if (!rr->evt->any_cpu && !cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
> + if (cpu_on_wrong_domain(rr))
> return -EINVAL;
>
> /*
^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2025-05-14 9:12 UTC | newest]
Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 0:33 [PATCH v4 00/31] x86/resctrl telemetry monitoring Tony Luck
2025-04-29 0:33 ` [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable Tony Luck
2025-05-08 3:28 ` Reinette Chatre
2025-05-08 18:32 ` Luck, Tony
2025-05-08 23:44 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events Tony Luck
2025-05-08 3:30 ` Reinette Chatre
2025-05-09 15:02 ` Peter Newman
2025-04-29 0:33 ` [PATCH v4 03/31] fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}() Tony Luck
2025-05-08 3:31 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 04/31] fs/resctrl: Change how and when events are initialized Tony Luck
2025-05-08 3:31 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
2025-05-08 3:32 ` Reinette Chatre
2025-05-10 9:58 ` Chen, Yu C
2025-05-12 14:19 ` Luck, Tony
2025-04-29 0:33 ` [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface Tony Luck
2025-04-30 23:02 ` Luck, Tony
2025-05-08 3:33 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 07/31] x86,fs/resctrl: Improve domain type checking Tony Luck
2025-05-08 3:36 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 08/31] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
2025-05-08 3:37 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 09/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
2025-05-08 3:37 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 10/31] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
2025-05-08 3:38 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 11/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
2025-05-08 3:39 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
2025-05-08 3:54 ` Reinette Chatre
2025-05-13 3:19 ` Chen, Yu C
2025-05-13 16:20 ` Luck, Tony
2025-05-14 9:11 ` Chen, Yu C
2025-04-29 0:33 ` [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats Tony Luck
2025-05-08 15:49 ` Reinette Chatre
2025-05-08 20:28 ` Luck, Tony
2025-05-08 23:45 ` Reinette Chatre
2025-05-09 11:29 ` Dave Martin
2025-05-09 14:46 ` Peter Newman
2025-05-09 16:38 ` Luck, Tony
2025-05-09 16:43 ` Dave Martin
2025-04-29 0:33 ` [PATCH v4 14/31] fs/resctrl: Add an architectural hook called for each mount Tony Luck
2025-05-08 15:50 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 15/31] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
2025-05-08 15:50 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 16/31] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
2025-05-08 15:53 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 17/31] x86/resctrl: Add second " Tony Luck
2025-05-08 15:54 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 18/31] x86/resctrl: Add third " Tony Luck
2025-05-08 15:56 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 19/31] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
2025-05-08 15:54 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 20/31] x86/resctrl: Check for adequate MMIO space Tony Luck
2025-05-08 15:56 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 21/31] x86/resctrl: Add fourth part of telemetry event enumeration Tony Luck
2025-05-08 15:56 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 22/31] x86/resctrl: Read core telemetry events Tony Luck
2025-05-08 15:57 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 23/31] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
2025-05-08 15:58 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 24/31] fs/resctrl: Add type define for PERF_PKG files Tony Luck
2025-04-29 0:33 ` [PATCH v4 25/31] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
2025-04-29 0:33 ` [PATCH v4 26/31] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
2025-05-08 15:58 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 27/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
2025-05-08 15:59 ` Reinette Chatre
2025-04-29 0:33 ` [PATCH v4 28/31] x86,fs/resctrl: Fix RMID allocation for multiple monitor resources Tony Luck
2025-04-29 0:33 ` [PATCH v4 29/31] fs/resctrl: Add interface for per-resource debug info files Tony Luck
2025-04-29 0:33 ` [PATCH v4 30/31] x86/resctrl: Add info/PERF_PKG_MON/status file Tony Luck
2025-04-29 0:33 ` [PATCH v4 31/31] x86/resctrl: Update Documentation for package events Tony Luck
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).