* [PATCH v5 00/29] x86/resctrl telemetry monitoring
@ 2025-05-21 22:50 Tony Luck
2025-05-21 22:50 ` [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions Tony Luck
` (30 more replies)
0 siblings, 31 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
These patches are based on tip x86/cache branch. HEAD at time of
snapshot is:
54d14f25664b ("MAINTAINERS: Add reviewers for fs/resctrl")
These patches are also available in the rdt-aet-v5 branch at:
Link: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
Changes (based on feedback from Reinette and a bug report from Chen Yu)
since v4 was posted here:
Link: https://lore.kernel.org/all/20250429003359.375508-1-tony.luck@intel.com/
Change map indexed by patch numbers in v4. Some patches have been merged,
split, dropped, or re-ordered. The v5 patch numbers are referred to
by their 4-digit git format-patch numbers in an attempt to avoid
confusion.
=== 1 ===
v4 patch was focussed on removing rdt_mon_features bitmap
which included moving all the mon_evt structure definitions
into an array. Reinette noted that this array would mean the
rdt_resource::evt_list is no longer needed.
v5 splits up the changes into three parts:
0001: Moves mon_evt structures into an array (now named
mon_event_all[]) and replaces use of rdt_resource::evt_list
with iteration of enabled events in the array.
0002: Replace resctrl_arch_is_llc_occupancy_enabled() with
resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)
(ditto for the mbm*enabled() inline functions)
0003: Remove remaining use of rdt_mon_features.
=== 2 ===
0004: Fix typos. Change parameter of resctrl_is_mbm_event() to enum.
s/QOS_NUM_MBM_EVENTS/QOS_NUM_L3_MBM_EVENTS/.
s/MBM_EVENT_IDX/MBM_STATE_IDX/. Rewrite get_arch_mbm_state()
in same simple style as get_mbm_state()
=== 3 ===
Dropped this patch. No immediate plans for other mbm monitor events
that could be used as input to the "mba_MBps" feedback control.
=== 4 ===
Also dropped. The rdt_resource::evt_list no longer exists, so no need
to rearrange code to build it at mount time.
=== 5 ===
Dropped the Kconfig changes for now. This means that intel_aet.c is
always built with CONFIG_X86_CPU_RESCTRL=y. Will need to revisit when
the CONFIG_INTEL_PMT_DISCOVERY driver is upstream.
=== 6 ===
0005: Added comments that fake interface is deliberately crafted
with parameters to exercise multiple aggregators per package
and insufficient RMIDs supported.
=== 7 ===
0006: Rename check_domain_header() to domain_header_is_valid()
=== 8 ===
Split into two parts:
0007: Better names for functions. Use "()" consistently in
commit message when naming functions.
0008: Better description that this change is just for domain add.
=== 9 ===
0009: New commit message with background and rationale for change.
=== 10 ===
0010: More context in commit message. Dropped an unnecessary container_of()
Made domain_add_cpu_ctrl() match domain_add_cpu_mon() with simple
path when adding a CPU to existing domain.
=== 11 ===
0011: Added rational for rename of rdt_mon_domain and rdt_hw_mon_domain
structures. Fixed alignment in structure definitions. Fixed broken
fir tree ordering.
=== 12 ===
Split into two parts:
0012: Make mon_data and rmid_read structures point to mon_evt instead of
just holding the event enum.
0013: The "read from any cpu" part.
Fixed bug reported by Chen Yu for use of smp_processor_id()
New shortlog description. Fixed "cpumast" typo. Separated
problem description from solution.
Fixed reverse fir tree.
Avoid "usually" comment (new comments in the helper function
that moved out of __mon_event_count().
=== 13 ===
0014: New direction. Don't bind specific value display formats
to specific events, limiting other architectures to follow in
the footsteps for the first to implement an event. Instead
allow architecture to specify how many binary fixed-point bits
are used for each event.
=== 14 ===
0015: Add period to end of sentence in comment for resctrl_arch_pre_mount().
Use atomic_try_cmpxchg() instead of atomic_cmpxchg().
=== 15 ===
0016: Updated commit comment to avoid "with code".
Dropped initialization of rdt_hw_resource::rid.
=== 16 ===
=== 17 ===
=== 18 ===
=== 20 ===
=== 21 ===
These were "first part", "second part" ... of enumeration and
the sanity check for adequate MMIO space to match expectations
from the XML file layout description.
0017:
0018:
0019:
Now describe what actions each part is doing. Building the
struct event_group fields as needed for each patch.
Split the fields into sets that are initialized from XML
files, and fields used by resctrl code to manage groups.
Fixed Link: lines with real URL to the Intel-PMT git repo.
Changed type of guid from int to u32.
Changed configure_events() return value from bool to standard
integer error code (and use -ENOMEM, -EINVAL where appropriate).
Document mmio_info structure and add an ascii art picture to the
commit comment showing how it is used.
Use kzalloc() instead of kmalloc()
Add a helper function skip_this_region() so that counting
regions and allocation for regions will do the same thing.
=== 19 ===
0020: Add description of layout of MMIO counters to commit comment.
=== 22 ===
0021: Fixed mmio address range check in intel_aet_read_event()
Changed return code from -EINVAL to -EIO to meet expectations
of rdtgroup_mondata_show().
Changed name of VALID_BIT define to DATA_VALID to indicate that
it shows that the value in a counter is valid (as opposed to the
counter itself).
Added check in resctrl_arch_rmid_read() that remainder of the
function after the check for RDT_RESOURCE_PERF_PKG has been
passed a RDT_RESOURCE_L3 resource.
=== 23 ===
0022: Fix typo s/domsins/domains/
Kept definition of struct rdt_perf_pkg_mon_domain in architecture
code. Reinette comment "This may thus be ok like this for now".
Since this only contains the rdt_domain_hdr, there isn't anything
extra that file system code could look at even if it somehow
wanted to.
Things may be different for a more complex resource that has
to maintain additional per-domain state that file system code
may need to be aware of.
=== 24 ===
I merged old patch 24 into new 0016
=== 25 ===
0023: Unchanged
=== 26 ===
0024: Split the hard-to-read rdt_check_option() function into two that
have names that convey what they do: rdt_is_option_force_enabled()
and rdt_is_option_force_disabled().
=== 27 ===
0025: Updated commit comment and kerneldoc comment to note that
pmt_event:num_rmids field is initialized from data in the
XML file, but may be overwritten.
Added min() operation to make sure num_rmids cannot be increased
when processing additional event groups.
=== 28 ===
0026: In V4 this added to the mount time initialization of the per-resource
event lists. But that code has been dropped from this series. Added
new one-time call in rdt_get_tree() (inside code where resctrl_mutex
is held). Same basic function to compute the number of RMIDs as the
minimum across all enabled monitor resources.
=== 29 ===
=== 30 ===
0027:
0028: V4 presented these as RFC to add a debug info file for a resource. But
after some thought I changed strategy so that the per-resource function
can choose the name. Also avoids it showing up as an empty file in
the info directory for other resources.
=== 31 ===
0029: No changes.
Background
----------
Telemetry features are being implemented in conjunction with the
IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
counts for various events to a collector in a nearby OOBMSM device to be
accumulated with counts for each <RMID, event> pair received from other
CPUs. Cores send event counts when the RMID value changes, or after each
2ms elapsed time.
Each OOBMSM device may implement multiple event collectors with each
servicing a subset of the logical CPUs on a package. In the initial
hardware implementation, there are two categories of events: energy
and perf.
1) Energy - Two counters
core_energy: This is an estimate of Joules consumed by each core. It is
calculated based on the types of instructions executed, not from a power
meter. This counter is useful to understand how much energy a workload
is consuming.
activity: This measures "accumulated dynamic capacitance". Users who
want to optimize energy consumption for a workload may use this rather
than core_energy because it provides consistent results independent of
any frequency or voltage changes that may occur during the runtime of
the application (e.g. entry/exit from turbo mode).
2) Performance - Seven counters
These are similar events to those available via the Linux "perf" tool,
but collected in a way with much lower overhead (no need to collect data
on every context switch).
stalls_llc_hit - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which hit in the LLC
c1_res - Counts the total C1 residency across all cores. The underlying
counter increments on 100MHz clock ticks
unhalted_core_cycles - Counts the total number of unhalted core clock
cycles
stalls_llc_miss - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which missed all the
local caches
c6_res - Counts the total C6 residency. The underlying counter increments
on crystal clock (25MHz) ticks
unhalted_ref_cycles - Counts the total number of unhalted reference clock
(TSC) cycles
uops_retired - Counts the total number of uops retired
The counters are arranged in groups in MMIO space of the OOBMSM device.
E.g. for the energy counters the layout is:
Offset: Counter
0x00 core energy for RMID 0
0x08 core activity for RMID 0
0x10 core energy for RMID 1
0x18 core activity for RMID 1
...
Enumeration
-----------
The only CPUID based enumeration for this feature is the legacy
CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
IA32_PQR_ASSOC MSR and the RMID field within it.
The OOBMSM driver discovers which features are present via
PCIe VSEC capabilities. Each feature is tagged with a unique
identifier. These identifiers indicate which XML description file from
https://github.com/intel/Intel-PMT describes which event counters are
available and their layout within the MMIO BAR space of the OOBMSM device.
Resctrl User Interface
----------------------
Because there may be multiple OOBMSM collection agents per processor
package, resctrl accumulates event counts from all agents on a package
and presents a single value to users. This will provide a consistent
user interface on future platforms that vary the number of collectors,
or the mappings from logical CPUs to collectors.
Users will continue to see the legacy monitoring files in the "L3"
directories and the telemetry files in the new "PERF_PKG" directories
(with each file providing the aggregated value from all OOBMSM collectors
on that package).
$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
├── mon_L3_00
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_01
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_PERF_PKG_00
│ ├── activity
│ ├── c1_res
│ ├── c6_res
│ ├── core_energy
│ ├── stalls_llc_hit
│ ├── stalls_llc_miss
│ ├── unhalted_core_cycles
│ ├── unhalted_ref_cycles
│ └── uops_retired
└── mon_PERF_PKG_01
├── activity
├── c1_res
├── c6_res
├── core_energy
├── stalls_llc_hit
├── stalls_llc_miss
├── unhalted_core_cycles
├── unhalted_ref_cycles
└── uops_retired
Resctrl Implementation
----------------------
The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
that returns an array of structures describing the per-RMID groups it
found from the VSEC enumeration. Linux looks at the unique identifiers
for each group and enables resctrl for all groups with known unique
identifiers.
The memory map for the counters for each <RMID, event> pair is described
by the XML file. This is too unwieldy to use in the Linux kernel, so a
simplified representation is built into the resctrl code. Note that the
counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
and IA32_QM_CTR MSRs. This means there is no need for cross-processor
calls to read counters from a CPU in a specific domain. The counters
can be read from any CPU.
High level description of code changes:
1) New scope RESCTRL_PACKAGE
2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
switch (r->rid) {
case RDT_RESOURCE_L3:
helper for L3
break;
case RDT_RESOURCE_PERF_PKG:
helper for PKG
break;
}
4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.
With only one platform providing this feature, it's tricky to tell
exactly where it is going to go. I've made the event definitions
platform specific (based on the unique ID from the VSEC enumeration). It
seems possible/likely that the list of events may change from generation
to generation.
I've picked names for events based on the descriptions in the XML file.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tony Luck (29):
x86,fs/resctrl: Consolidate monitor event descriptions
x86,fs/resctrl: Replace architecture event enabled checks
x86/resctrl: Remove 'rdt_mon_features' global variable
x86,fs/resctrl: Prepare for more monitor events
x86/rectrl: Fake OOBMSM interface
x86,fs/resctrl: Improve domain type checking
x86,fs/resctrl: Rename some L3 specific functions
x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain
types
x86/resctrl: Change generic domain functions to use struct
rdt_domain_hdr
x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
fs/resctrl: Make event details accessible to functions when reading
events
x86,fs/resctrl: Handle events that can be read from any CPU
x86,fs/resctrl: Support binary fixed point event counters
fs/resctrl: Add an architectural hook called for each mount
x86/resctrl: Add and initialize rdt_resource for package scope core
monitor
x86/resctrl: Discover hardware telemetry events
x86/resctrl: Count valid telemetry aggregators per package
x86/resctrl: Complete telemetry event enumeration
x86,fs/resctrl: Fill in details of Clearwater Forest events
x86/resctrl: x86/resctrl: Read core telemetry events
x86,fs/resctrl: Handle domain creation/deletion for
RDT_RESOURCE_PERF_PKG
x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
x86/resctrl: Add energy/perf choices to rdt boot option
x86/resctrl: Handle number of RMIDs supported by telemetry resources
x86,fs/resctrl: Move RMID initialization to first mount
fs/resctrl: Add file system mechanism for architecture info file
x86/resctrl: Add info/PERF_PKG_MON/status file
x86/resctrl: Update Documentation for package events
.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/filesystems/resctrl.rst | 53 ++-
include/linux/resctrl.h | 97 ++++-
include/linux/resctrl_types.h | 14 +
arch/x86/include/asm/resctrl.h | 16 -
.../cpu/resctrl/fake_intel_aet_features.h | 73 ++++
arch/x86/kernel/cpu/resctrl/internal.h | 30 +-
fs/resctrl/internal.h | 87 ++--
arch/x86/kernel/cpu/resctrl/core.c | 314 ++++++++++----
.../cpu/resctrl/fake_intel_aet_features.c | 97 +++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 388 ++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 67 +--
fs/resctrl/ctrlmondata.c | 108 ++++-
fs/resctrl/monitor.c | 262 +++++++-----
fs/resctrl/rdtgroup.c | 259 ++++++++----
arch/x86/Kconfig | 2 +-
arch/x86/kernel/cpu/resctrl/Makefile | 2 +
17 files changed, 1439 insertions(+), 432 deletions(-)
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
base-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
base-branch: x86/cache
base-commit: 54d14f25664bbb75c2928dd0d64a095c0f488176
--
2.49.0
^ permalink raw reply [flat|nested] 90+ messages in thread
* [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:25 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 02/29] x86,fs/resctrl: Replace architecture event enabled checks Tony Luck
` (29 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There are currently only three monitor events, all associated with
the RDT_RESOURCE_L3 resource. Growing support for additional events
will be easier with some restructuring to have a single point in
file system code where all attributes of all events are defined.
Place all event descriptions into an array mon_event_all[]. Doing
this has the beneficial side effect of removing the need for
rdt_resource::evt_list.
Drop the code that builds evt_list and change the two places where
the list is scanned to scan mon_event_all[] instead.
Architecture code now informs file system code which events are
available with resctrl_enable_mon_event().
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 4 +-
fs/resctrl/internal.h | 10 +++--
arch/x86/kernel/cpu/resctrl/core.c | 12 ++++--
fs/resctrl/monitor.c | 63 +++++++++++++++---------------
fs/resctrl/rdtgroup.c | 11 +++---
5 files changed, 55 insertions(+), 45 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 9ba771f2ddea..014cc6fe4a9b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -269,7 +269,6 @@ enum resctrl_schema_fmt {
* @mon_domains: RCU list of all monitor domains for this resource
* @name: Name to use in "schemata" file.
* @schema_fmt: Which format string and parser is used for this schema.
- * @evt_list: List of monitoring events
* @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth
* monitoring events can be configured.
* @cdp_capable: Is the CDP feature available on this resource
@@ -287,7 +286,6 @@ struct rdt_resource {
struct list_head mon_domains;
char *name;
enum resctrl_schema_fmt schema_fmt;
- struct list_head evt_list;
unsigned int mbm_cfg_mask;
bool cdp_capable;
};
@@ -372,6 +370,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
+void resctrl_enable_mon_event(enum resctrl_event_id evtid);
+
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
/**
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 9a8cf6f11151..94e635656261 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -52,19 +52,23 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
}
/**
- * struct mon_evt - Entry in the event list of a resource
+ * struct mon_evt - Description of a monitor event
* @evtid: event id
+ * @rid: index of the resource for this event
* @name: name of the event
* @configurable: true if the event is configurable
- * @list: entry in &rdt_resource->evt_list
+ * @enabled: true if the event is enabled
*/
struct mon_evt {
enum resctrl_event_id evtid;
+ enum resctrl_res_level rid;
char *name;
bool configurable;
- struct list_head list;
+ bool enabled;
};
+extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
+
/**
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 224bed28f341..3d74c2d3dcea 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -861,12 +861,18 @@ static __init bool get_rdt_mon_resources(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
+ if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
- if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL))
+ }
+ if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID);
- if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
+ }
+ if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
+ }
if (!rdt_mon_features)
return false;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index bde2801289d3..31c81d703ff4 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -842,38 +842,39 @@ static void dom_data_exit(struct rdt_resource *r)
mutex_unlock(&rdtgroup_mutex);
}
-static struct mon_evt llc_occupancy_event = {
- .name = "llc_occupancy",
- .evtid = QOS_L3_OCCUP_EVENT_ID,
-};
-
-static struct mon_evt mbm_total_event = {
- .name = "mbm_total_bytes",
- .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
-};
-
-static struct mon_evt mbm_local_event = {
- .name = "mbm_local_bytes",
- .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
-};
-
/*
- * Initialize the event list for the resource.
- *
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
+ * All available events. Architecture code marks the ones that
+ * are supported by a system using resctrl_enable_mon_event()
+ * to set .enabled.
*/
-static void l3_mon_evt_init(struct rdt_resource *r)
+struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
+ [QOS_L3_OCCUP_EVENT_ID] = {
+ .name = "llc_occupancy",
+ .evtid = QOS_L3_OCCUP_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
+ },
+ [QOS_L3_MBM_TOTAL_EVENT_ID] = {
+ .name = "mbm_total_bytes",
+ .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
+ },
+ [QOS_L3_MBM_LOCAL_EVENT_ID] = {
+ .name = "mbm_local_bytes",
+ .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
+ .rid = RDT_RESOURCE_L3,
+ },
+};
+
+void resctrl_enable_mon_event(enum resctrl_event_id evtid)
{
- INIT_LIST_HEAD(&r->evt_list);
+ if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
+ return;
+ if (mon_event_all[evtid].enabled) {
+ pr_warn("Duplicate enable for event %d\n", evtid);
+ return;
+ }
- if (resctrl_arch_is_llc_occupancy_enabled())
- list_add_tail(&llc_occupancy_event.list, &r->evt_list);
- if (resctrl_arch_is_mbm_total_enabled())
- list_add_tail(&mbm_total_event.list, &r->evt_list);
- if (resctrl_arch_is_mbm_local_enabled())
- list_add_tail(&mbm_local_event.list, &r->evt_list);
+ mon_event_all[evtid].enabled = true;
}
/**
@@ -900,15 +901,13 @@ int resctrl_mon_resource_init(void)
if (ret)
return ret;
- l3_mon_evt_init(r);
-
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
- mbm_total_event.configurable = true;
+ mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) {
- mbm_local_event.configurable = true;
+ mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_local_bytes_config",
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index cc37f58b47dd..69e0d40c4449 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1150,7 +1150,9 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
struct mon_evt *mevt;
- list_for_each_entry(mevt, &r->evt_list, list) {
+ for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
+ if (mevt->rid != r->rid || !mevt->enabled)
+ continue;
seq_printf(seq, "%s\n", mevt->name);
if (mevt->configurable)
seq_printf(seq, "%s_config\n", mevt->name);
@@ -3055,10 +3057,9 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
struct mon_evt *mevt;
int ret, domid;
- if (WARN_ON(list_empty(&r->evt_list)))
- return -EPERM;
-
- list_for_each_entry(mevt, &r->evt_list, list) {
+ for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
+ if (mevt->rid != r->rid || !mevt->enabled)
+ continue;
domid = do_sum ? d->ci->id : d->hdr.id;
priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
if (WARN_ON_ONCE(!priv))
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 02/29] x86,fs/resctrl: Replace architecture event enabled checks
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
2025-05-21 22:50 ` [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:26 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 03/29] x86/resctrl: Remove 'rdt_mon_features' global variable Tony Luck
` (28 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The resctrl file system now has complete knowledge of the status
of every event. So there is no need for per-event function calls
to check.
Replace each of the resctrl_arch_is_{event}enabled() calls with
resctrl_is_mon_event_enabled(QOS_{EVENT}).
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 ++
arch/x86/include/asm/resctrl.h | 15 ---------------
arch/x86/kernel/cpu/resctrl/core.c | 4 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
fs/resctrl/ctrlmondata.c | 4 ++--
fs/resctrl/monitor.c | 15 ++++++++++-----
fs/resctrl/rdtgroup.c | 18 +++++++++---------
7 files changed, 27 insertions(+), 35 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 014cc6fe4a9b..843ad7c8e247 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -372,6 +372,8 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
void resctrl_enable_mon_event(enum resctrl_event_id evtid);
+bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
+
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
/**
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index ad497ab196d1..9c889f51b7f1 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -82,21 +82,6 @@ static inline void resctrl_arch_disable_mon(void)
static_branch_dec_cpuslocked(&rdt_enable_key);
}
-static inline bool resctrl_arch_is_llc_occupancy_enabled(void)
-{
- return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID));
-}
-
-static inline bool resctrl_arch_is_mbm_total_enabled(void)
-{
- return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID));
-}
-
-static inline bool resctrl_arch_is_mbm_local_enabled(void)
-{
- return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID));
-}
-
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 3d74c2d3dcea..f4f4c1d42710 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -401,13 +401,13 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
size_t tsize;
- if (resctrl_arch_is_mbm_total_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
tsize = sizeof(*hw_dom->arch_mbm_total);
hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
if (!hw_dom->arch_mbm_total)
return -ENOMEM;
}
- if (resctrl_arch_is_mbm_local_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
tsize = sizeof(*hw_dom->arch_mbm_local);
hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
if (!hw_dom->arch_mbm_local) {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3fc4d9f56f0d..a1296ee7d508 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -206,11 +206,11 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
- if (resctrl_arch_is_mbm_total_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
memset(hw_dom->arch_mbm_total, 0,
sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
memset(hw_dom->arch_mbm_local, 0,
sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
}
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 6ed2dfd4dbbd..6be423c5e2e0 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -473,12 +473,12 @@ ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
rdt_last_cmd_clear();
if (!strcmp(buf, "mbm_local_bytes")) {
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
else
ret = -EINVAL;
} else if (!strcmp(buf, "mbm_total_bytes")) {
- if (resctrl_arch_is_mbm_total_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
else
ret = -EINVAL;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 31c81d703ff4..325e23c1a403 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -336,7 +336,7 @@ void free_rmid(u32 closid, u32 rmid)
entry = __rmid_entry(idx);
- if (resctrl_arch_is_llc_occupancy_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
add_rmid_to_limbo(entry);
else
list_add_tail(&entry->list, &rmid_free_lru);
@@ -635,10 +635,10 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
* This is protected from concurrent reads from user as both
* the user and overflow handler hold the global mutex.
*/
- if (resctrl_arch_is_mbm_total_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_TOTAL_EVENT_ID);
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_LOCAL_EVENT_ID);
}
@@ -877,6 +877,11 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid)
mon_event_all[evtid].enabled = true;
}
+bool resctrl_is_mon_event_enabled(enum resctrl_event_id evtid)
+{
+ return evtid < QOS_NUM_EVENTS && mon_event_all[evtid].enabled;
+}
+
/**
* resctrl_mon_resource_init() - Initialise global monitoring structures.
*
@@ -912,9 +917,9 @@ int resctrl_mon_resource_init(void)
RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
}
- if (resctrl_arch_is_mbm_local_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
mba_mbps_default_event = QOS_L3_MBM_LOCAL_EVENT_ID;
- else if (resctrl_arch_is_mbm_total_enabled())
+ else if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
return 0;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 69e0d40c4449..80e74940281a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -123,8 +123,8 @@ void rdt_staged_configs_clear(void)
static bool resctrl_is_mbm_enabled(void)
{
- return (resctrl_arch_is_mbm_total_enabled() ||
- resctrl_arch_is_mbm_local_enabled());
+ return (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID) ||
+ resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID));
}
static bool resctrl_is_mbm_event(int e)
@@ -196,7 +196,7 @@ static int closid_alloc(void)
lockdep_assert_held(&rdtgroup_mutex);
if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID) &&
- resctrl_arch_is_llc_occupancy_enabled()) {
+ resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
cleanest_closid = resctrl_find_cleanest_closid();
if (cleanest_closid < 0)
return cleanest_closid;
@@ -4047,7 +4047,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
- if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
/*
* When a package is going down, forcefully
* decrement rmid->ebusy. There is no way to know
@@ -4083,12 +4083,12 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize;
- if (resctrl_arch_is_llc_occupancy_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
- if (resctrl_arch_is_mbm_total_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
tsize = sizeof(*d->mbm_total);
d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_total) {
@@ -4096,7 +4096,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain
return -ENOMEM;
}
}
- if (resctrl_arch_is_mbm_local_enabled()) {
+ if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
tsize = sizeof(*d->mbm_local);
d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
if (!d->mbm_local) {
@@ -4141,7 +4141,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
RESCTRL_PICK_ANY_CPU);
}
- if (resctrl_arch_is_llc_occupancy_enabled())
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
/*
@@ -4216,7 +4216,7 @@ void resctrl_offline_cpu(unsigned int cpu)
cancel_delayed_work(&d->mbm_over);
mbm_setup_overflow_handler(d, 0, cpu);
}
- if (resctrl_arch_is_llc_occupancy_enabled() &&
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
cpu == d->cqm_work_cpu && has_busy_rmid(d)) {
cancel_delayed_work(&d->cqm_limbo);
cqm_setup_limbo_handler(d, 0, cpu);
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 03/29] x86/resctrl: Remove 'rdt_mon_features' global variable
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
2025-05-21 22:50 ` [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions Tony Luck
2025-05-21 22:50 ` [PATCH v5 02/29] x86,fs/resctrl: Replace architecture event enabled checks Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:27 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events Tony Luck
` (27 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
This variable was used as a bitmask of enabled monitor events. But
that function is now provided by the filesystem mon_event_all[] array.
Remove the remaining uses of this variable.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/include/asm/resctrl.h | 1 -
arch/x86/kernel/cpu/resctrl/core.c | 9 +++++----
arch/x86/kernel/cpu/resctrl/monitor.c | 5 -----
3 files changed, 5 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 9c889f51b7f1..089742970cc1 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -42,7 +42,6 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
extern bool rdt_alloc_capable;
extern bool rdt_mon_capable;
-extern unsigned int rdt_mon_features;
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f4f4c1d42710..819bc7a09327 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -860,21 +860,22 @@ static __init bool get_rdt_alloc_resources(void)
static __init bool get_rdt_mon_resources(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
- rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
+ ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
- rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID);
+ ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
- rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
+ ret = true;
}
- if (!rdt_mon_features)
+ if (!ret)
return false;
return !rdt_get_mon_l3_config(r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index a1296ee7d508..fda579251dba 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -30,11 +30,6 @@
*/
bool rdt_mon_capable;
-/*
- * Global to indicate which monitoring events are enabled.
- */
-unsigned int rdt_mon_features;
-
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
static int snc_nodes_per_l3_cache = 1;
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (2 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 03/29] x86/resctrl: Remove 'rdt_mon_features' global variable Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-23 9:00 ` Peter Newman
` (2 more replies)
2025-05-21 22:50 ` [PATCH v5 05/29] x86/rectrl: Fake OOBMSM interface Tony Luck
` (26 subsequent siblings)
30 siblings, 3 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There's a rule in computer programming that objects appear zero,
once, or many times. So code accordingly.
There are two MBM events and resctrl is coded with a lot of
if (local)
do one thing
if (total)
do a different thing
Change the rdt_mon_domain and rdt_hw_mon_domain structures to hold arrays
of pointers to per event data instead of explicit fields for total and
local bandwidth.
Simplify by coding for many events using loops on which are enabled.
Move resctrl_is_mbm_event() to <linux/resctrl.h> so it can be used more
widely. Also provide a for_each_mbm_event() helper macro.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 15 +++++---
include/linux/resctrl_types.h | 3 ++
arch/x86/kernel/cpu/resctrl/internal.h | 6 ++--
arch/x86/kernel/cpu/resctrl/core.c | 38 ++++++++++----------
arch/x86/kernel/cpu/resctrl/monitor.c | 36 +++++++++----------
fs/resctrl/monitor.c | 13 ++++---
fs/resctrl/rdtgroup.c | 48 ++++++++++++--------------
7 files changed, 82 insertions(+), 77 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 843ad7c8e247..40f2d0d48d02 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -161,8 +161,7 @@ struct rdt_ctrl_domain {
* @hdr: common header for different domain types
* @ci: cache info for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
- * @mbm_total: saved state for MBM total bandwidth
- * @mbm_local: saved state for MBM local bandwidth
+ * @mbm_states: saved state for each QOS MBM event
* @mbm_over: worker to periodically read MBM h/w counters
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu: worker CPU for MBM h/w counters
@@ -172,8 +171,7 @@ struct rdt_mon_domain {
struct rdt_domain_hdr hdr;
struct cacheinfo *ci;
unsigned long *rmid_busy_llc;
- struct mbm_state *mbm_total;
- struct mbm_state *mbm_local;
+ struct mbm_state *mbm_states[QOS_NUM_L3_MBM_EVENTS];
struct delayed_work mbm_over;
struct delayed_work cqm_limbo;
int mbm_work_cpu;
@@ -376,6 +374,15 @@ bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
+static inline bool resctrl_is_mbm_event(enum resctrl_event_id e)
+{
+ return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
+ e <= QOS_L3_MBM_LOCAL_EVENT_ID);
+}
+
+#define for_each_mbm_event(evt) \
+ for (evt = QOS_L3_MBM_TOTAL_EVENT_ID; evt <= QOS_L3_MBM_LOCAL_EVENT_ID; evt++)
+
/**
* resctrl_arch_mon_event_config_write() - Write the config for an event.
* @config_info: struct resctrl_mon_config_info describing the resource, domain
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index a25fb9c4070d..b468bfbab9ea 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -47,4 +47,7 @@ enum resctrl_event_id {
QOS_NUM_EVENTS,
};
+#define QOS_NUM_L3_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
+#define MBM_STATE_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
+
#endif /* __LINUX_RESCTRL_TYPES_H */
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5e3c41b36437..ea185b4d0d59 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -54,15 +54,13 @@ struct rdt_hw_ctrl_domain {
* struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
* a resource for a monitor function
* @d_resctrl: Properties exposed to the resctrl file system
- * @arch_mbm_total: arch private state for MBM total bandwidth
- * @arch_mbm_local: arch private state for MBM local bandwidth
+ * @arch_mbm_states: arch private state for each MBM event
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
struct rdt_hw_mon_domain {
struct rdt_mon_domain d_resctrl;
- struct arch_mbm_state *arch_mbm_total;
- struct arch_mbm_state *arch_mbm_local;
+ struct arch_mbm_state *arch_mbm_states[QOS_NUM_L3_MBM_EVENTS];
};
static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 819bc7a09327..4403a820db12 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -364,8 +364,8 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
- kfree(hw_dom->arch_mbm_total);
- kfree(hw_dom->arch_mbm_local);
+ for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
+ kfree(hw_dom->arch_mbm_states[i]);
kfree(hw_dom);
}
@@ -399,25 +399,27 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
*/
static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
{
- size_t tsize;
-
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
- tsize = sizeof(*hw_dom->arch_mbm_total);
- hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
- if (!hw_dom->arch_mbm_total)
- return -ENOMEM;
- }
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
- tsize = sizeof(*hw_dom->arch_mbm_local);
- hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
- if (!hw_dom->arch_mbm_local) {
- kfree(hw_dom->arch_mbm_total);
- hw_dom->arch_mbm_total = NULL;
- return -ENOMEM;
- }
+ size_t tsize = sizeof(struct arch_mbm_state);
+ enum resctrl_event_id evt;
+ int idx;
+
+ for_each_mbm_event(evt) {
+ if (!resctrl_is_mon_event_enabled(evt))
+ continue;
+ idx = MBM_STATE_IDX(evt);
+ hw_dom->arch_mbm_states[idx] = kcalloc(num_rmid, tsize, GFP_KERNEL);
+ if (!hw_dom->arch_mbm_states[idx])
+ goto cleanup;
}
return 0;
+cleanup:
+ while (--idx >= 0) {
+ kfree(hw_dom->arch_mbm_states[idx]);
+ hw_dom->arch_mbm_states[idx] = NULL;
+ }
+
+ return -ENOMEM;
}
static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index fda579251dba..85526e5540f2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -160,18 +160,14 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
u32 rmid,
enum resctrl_event_id eventid)
{
- switch (eventid) {
- case QOS_L3_OCCUP_EVENT_ID:
- return NULL;
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &hw_dom->arch_mbm_total[rmid];
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &hw_dom->arch_mbm_local[rmid];
- default:
- /* Never expect to get here */
- WARN_ON_ONCE(1);
+ struct arch_mbm_state *state;
+
+ if (!resctrl_is_mbm_event(eventid))
return NULL;
- }
+
+ state = hw_dom->arch_mbm_states[MBM_STATE_IDX(eventid)];
+
+ return state ? &state[rmid] : NULL;
}
void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
@@ -200,14 +196,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
{
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
-
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- memset(hw_dom->arch_mbm_total, 0,
- sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
-
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- memset(hw_dom->arch_mbm_local, 0,
- sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
+ enum resctrl_event_id evt;
+ int idx;
+
+ for_each_mbm_event(evt) {
+ idx = MBM_STATE_IDX(evt);
+ if (!hw_dom->arch_mbm_states[idx])
+ continue;
+ memset(hw_dom->arch_mbm_states[idx], 0,
+ sizeof(struct arch_mbm_state) * r->num_rmid);
+ }
}
static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 325e23c1a403..4cd0789998bf 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -346,15 +346,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
+ struct mbm_state *states;
- switch (evtid) {
- case QOS_L3_MBM_TOTAL_EVENT_ID:
- return &d->mbm_total[idx];
- case QOS_L3_MBM_LOCAL_EVENT_ID:
- return &d->mbm_local[idx];
- default:
+ if (!resctrl_is_mbm_event(evtid))
return NULL;
- }
+
+ states = d->mbm_states[MBM_STATE_IDX(evtid)];
+
+ return states ? &states[idx] : NULL;
}
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 80e74940281a..8649b89d7bfd 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -127,12 +127,6 @@ static bool resctrl_is_mbm_enabled(void)
resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID));
}
-static bool resctrl_is_mbm_event(int e)
-{
- return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
- e <= QOS_L3_MBM_LOCAL_EVENT_ID);
-}
-
/*
* Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap
* of free CLOSIDs.
@@ -4020,8 +4014,10 @@ static void rdtgroup_setup_default(void)
static void domain_destroy_mon_state(struct rdt_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
- kfree(d->mbm_total);
- kfree(d->mbm_local);
+ for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++) {
+ kfree(d->mbm_states[i]);
+ d->mbm_states[i] = NULL;
+ }
}
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
@@ -4081,32 +4077,34 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
- size_t tsize;
+ size_t tsize = sizeof(struct mbm_state);
+ enum resctrl_event_id evt;
+ int idx;
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
}
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
- tsize = sizeof(*d->mbm_total);
- d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
- if (!d->mbm_total) {
- bitmap_free(d->rmid_busy_llc);
- return -ENOMEM;
- }
- }
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
- tsize = sizeof(*d->mbm_local);
- d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
- if (!d->mbm_local) {
- bitmap_free(d->rmid_busy_llc);
- kfree(d->mbm_total);
- return -ENOMEM;
- }
+
+ for_each_mbm_event(evt) {
+ if (!resctrl_is_mon_event_enabled(evt))
+ continue;
+ idx = MBM_STATE_IDX(evt);
+ d->mbm_states[idx] = kcalloc(idx_limit, tsize, GFP_KERNEL);
+ if (!d->mbm_states[idx])
+ goto cleanup;
}
return 0;
+cleanup:
+ bitmap_free(d->rmid_busy_llc);
+ while (--idx >= 0) {
+ kfree(d->mbm_states[idx]);
+ d->mbm_states[idx] = NULL;
+ }
+
+ return -ENOMEM;
}
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 05/29] x86/rectrl: Fake OOBMSM interface
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (3 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-23 23:38 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking Tony Luck
` (25 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Real version is coming soon[1] ... this is here so the remaining parts
will build (and run ... assuming a 2 socket system that supports RDT
monitoring ... only missing part is that the event counters just
report fixed values).
Faked values are provided to exercise some special conditions:
1) Multiple counter aggregators for an event per-socket.
2) Different number of supported RMIDs for each group.
Just for ease of testing and RFC discussion.
[1]
Link: https://lore.kernel.org/all/20250430212106.369208-1-david.e.box@linux.intel.com/
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
.../cpu/resctrl/fake_intel_aet_features.h | 73 ++++++++++++++
.../cpu/resctrl/fake_intel_aet_features.c | 97 +++++++++++++++++++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
3 files changed, 171 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
new file mode 100644
index 000000000000..c835c4108abc
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Bits stolen from OOBMSM VSEC discovery code */
+
+enum pmt_feature_id {
+ FEATURE_INVALID = 0x0,
+ FEATURE_PER_CORE_PERF_TELEM = 0x1,
+ FEATURE_PER_CORE_ENV_TELEM = 0x2,
+ FEATURE_PER_RMID_PERF_TELEM = 0x3,
+ FEATURE_ACCEL_TELEM = 0x4,
+ FEATURE_UNCORE_TELEM = 0x5,
+ FEATURE_CRASH_LOG = 0x6,
+ FEATURE_PETE_LOG = 0x7,
+ FEATURE_TPMI_CTRL = 0x8,
+ FEATURE_RESERVED = 0x9,
+ FEATURE_TRACING = 0xA,
+ FEATURE_PER_RMID_ENERGY_TELEM = 0xB,
+ FEATURE_MAX = 0xB,
+};
+
+/**
+ * struct oobmsm_plat_info - Platform information for a device instance
+ * @cdie_mask: Mask of all compute dies in the partition
+ * @package_id: CPU Package id
+ * @partition: Package partition id when multiple VSEC PCI devices per package
+ * @segment: PCI segment ID
+ * @bus_number: PCI bus number
+ * @device_number: PCI device number
+ * @function_number: PCI function number
+ *
+ * Structure to store platform data for a OOBMSM device instance.
+ */
+struct oobmsm_plat_info {
+ u16 cdie_mask;
+ u8 package_id;
+ u8 partition;
+ u8 segment;
+ u8 bus_number;
+ u8 device_number;
+ u8 function_number;
+};
+
+enum oobmsm_supplier_type {
+ OOBMSM_SUP_PLAT_INFO,
+ OOBMSM_SUP_DISC_INFO,
+ OOBMSM_SUP_S3M_SIMICS,
+ OOBMSM_SUP_TYPE_MAX
+};
+
+struct oobmsm_mapping_supplier {
+ struct device *supplier_dev[OOBMSM_SUP_TYPE_MAX];
+ struct oobmsm_plat_info plat_info;
+ unsigned long features;
+};
+
+struct telemetry_region {
+ struct oobmsm_plat_info plat_info;
+ void __iomem *addr;
+ size_t size;
+ u32 guid;
+ u32 num_rmids;
+};
+
+struct pmt_feature_group {
+ enum pmt_feature_id id;
+ int count;
+ struct kref kref;
+ struct telemetry_region regions[];
+};
+
+struct pmt_feature_group *intel_pmt_get_regions_by_feature(enum pmt_feature_id id);
+
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group);
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
new file mode 100644
index 000000000000..80f38f1ee3df
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/cleanup.h>
+#include <linux/minmax.h>
+#include <linux/slab.h>
+#include "fake_intel_aet_features.h"
+#include <linux/intel_vsec.h>
+#include <linux/resctrl.h>
+
+#include "internal.h"
+
+/*
+ * Amount of memory for each fake MMIO space
+ * Magic numbers here match values for XML ID 0x26696143 and 0x26557651
+ * 576: Number of RMIDs
+ * 2: Energy events in 0x26557651
+ * 7: Perf events in 0x26696143
+ * 3: Qwords for status counters after the event counters
+ * 8: Bytes for each counter
+ */
+
+#define ENERGY_QWORDS ((576 * 2) + 3)
+#define ENERGY_SIZE (ENERGY_QWORDS * 8)
+#define PERF_QWORDS ((576 * 7) + 3)
+#define PERF_SIZE (PERF_QWORDS * 8)
+
+static long pg[4 * ENERGY_QWORDS + 2 * PERF_QWORDS];
+
+/*
+ * Fill the fake MMIO space with all different values,
+ * all with BIT(63) set to indicate valid entries.
+ */
+static int __init fill(void)
+{
+ u64 val = 0;
+
+ for (int i = 0; i < sizeof(pg); i += sizeof(val)) {
+ pg[i / sizeof(val)] = BIT_ULL(63) + val;
+ val++;
+ }
+ return 0;
+}
+device_initcall(fill);
+
+#define PKG_REGION(_entry, _guid, _addr, _size, _pkg, _num_rmids) \
+ [_entry] = { .guid = _guid, .addr = (void __iomem *)_addr, \
+ .num_rmids = _num_rmids, \
+ .size = _size, .plat_info = { .package_id = _pkg }}
+
+/*
+ * Set up a fake return for call to:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
+ * Pretend there are two aggregators on each of the sockets to test
+ * the code that sums over multiple aggregators.
+ * Pretend this group only supports 64 RMIDs to exercise the code
+ * that reconciles support for different RMID counts.
+ */
+static struct pmt_feature_group fake_energy = {
+ .count = 4,
+ .regions = {
+ PKG_REGION(0, 0x26696143, &pg[0 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
+ PKG_REGION(1, 0x26696143, &pg[1 * ENERGY_QWORDS], ENERGY_SIZE, 0, 64),
+ PKG_REGION(2, 0x26696143, &pg[2 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64),
+ PKG_REGION(3, 0x26696143, &pg[3 * ENERGY_QWORDS], ENERGY_SIZE, 1, 64)
+ }
+};
+
+/*
+ * Fake return for:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
+ */
+static struct pmt_feature_group fake_perf = {
+ .count = 2,
+ .regions = {
+ PKG_REGION(0, 0x26557651, &pg[4 * ENERGY_QWORDS + 0 * PERF_QWORDS], PERF_SIZE, 0, 576),
+ PKG_REGION(1, 0x26557651, &pg[4 * ENERGY_QWORDS + 1 * PERF_QWORDS], PERF_SIZE, 1, 576)
+ }
+};
+
+struct pmt_feature_group *
+intel_pmt_get_regions_by_feature(enum pmt_feature_id id)
+{
+ switch (id) {
+ case FEATURE_PER_RMID_ENERGY_TELEM:
+ return &fake_energy;
+ case FEATURE_PER_RMID_PERF_TELEM:
+ return &fake_perf;
+ default:
+ return ERR_PTR(-ENOENT);
+ }
+}
+
+/*
+ * Nothing needed for the "put" function.
+ */
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group)
+{
+}
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index d8a04b195da2..cf4fac58d068 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_X86_CPU_RESCTRL) += fake_intel_aet_features.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
# To allow define_trace.h's recursive include:
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (4 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 05/29] x86/rectrl: Fake OOBMSM interface Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:31 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 07/29] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
` (24 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The rdt_domain_hdr structure is used in both control and monitor
domain structures to provide common methods for operations such as
adding a CPU to a domain, removing a CPU from a domain, accessing
the mask of all CPUs in a domain.
The "type" field provides a simple check whether a domain is a
control or monitor domain so that programming errors operating
on domains will be quickly caught.
To prepare for additional domain types that depend on the rdt_resource
to which they are connected add the resource id into the header
and check that in addition to the type.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 9 +++++++++
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++----
fs/resctrl/ctrlmondata.c | 2 +-
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 40f2d0d48d02..d6b09952ef92 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -131,15 +131,24 @@ enum resctrl_domain_type {
* @list: all instances of this resource
* @id: unique id for this instance
* @type: type of this instance
+ * @rid: index of resource for this domain
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
enum resctrl_domain_type type;
+ enum resctrl_res_level rid;
struct cpumask cpu_mask;
};
+static inline bool domain_header_is_valid(struct rdt_domain_hdr *hdr,
+ enum resctrl_domain_type type,
+ enum resctrl_res_level rid)
+{
+ return !WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
+}
+
/**
* struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
* @hdr: common header for different domain types
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4403a820db12..4983f6f81218 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -456,7 +456,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -473,6 +473,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_CTRL_DOMAIN;
+ d->hdr.rid = r->rid;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
rdt_domain_reconfigure_cdp(r);
@@ -511,7 +512,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
@@ -526,6 +527,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = r->rid;
d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!d->ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -581,7 +583,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -627,7 +629,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 6be423c5e2e0..7d16f7eb6985 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -638,7 +638,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
* the resource to find the domain with "domid".
*/
hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+ if (!hdr || !domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid)) {
ret = -ENOENT;
goto out;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 07/29] x86,fs/resctrl: Rename some L3 specific functions
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (5 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:32 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 08/29] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
` (23 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
All monitor events used to be connected to the L3 resource so
it was OK for function names to be generic. But this will cause
confusion with additional events tied to other resources.
Rename functions that are only used for L3 features:
arch_mon_domain_online() -> arch_l3_mon_domain_online()
mon_domain_free() -> l3_mon_domain_free()
domain_setup_mon_state() -> domain_setup_l3_mon_state
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++------
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
fs/resctrl/rdtgroup.c | 6 +++---
4 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index ea185b4d0d59..038c888dcdcf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -122,7 +122,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
extern struct rdt_hw_resource rdt_resources_all[];
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4983f6f81218..c721d1712e97 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -362,7 +362,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
+static void l3_mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
{
for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
kfree(hw_dom->arch_mbm_states[i]);
@@ -531,15 +531,15 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!d->ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
return;
}
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- arch_mon_domain_online(r, d);
+ arch_l3_mon_domain_online(r, d);
if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
return;
}
@@ -549,7 +549,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
}
}
@@ -640,7 +640,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
resctrl_offline_mon_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
return;
}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 85526e5540f2..659265330783 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -264,7 +264,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
* must adjust RMID counter numbers based on SNC node. See
* logical_rmid_to_physical_rmid() for code that does this.
*/
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
{
if (snc_nodes_per_l3_cache > 1)
msr_clear_bit(MSR_RMID_SNC_CONFIG, 0);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 8649b89d7bfd..8aa9a7e68a59 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4062,7 +4062,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
}
/**
- * domain_setup_mon_state() - Initialise domain monitoring structures.
+ * domain_setup_l3_mon_state() - Initialise domain monitoring structures.
* @r: The resource for the newly online domain.
* @d: The newly online domain.
*
@@ -4074,7 +4074,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
*
* Returns 0 for success, or -ENOMEM.
*/
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
+static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(struct mbm_state);
@@ -4129,7 +4129,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
mutex_lock(&rdtgroup_mutex);
- err = domain_setup_mon_state(r, d);
+ err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 08/29] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (6 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 07/29] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-21 22:50 ` [PATCH v5 09/29] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
` (22 subsequent siblings)
30 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
To prepare for additional types of monitoring domains, move all the L3
resource monitoring domain initialization out of domain_add_cpu_mon()
and into a new helper function l3_mon_domain_setup() (name chosen
as the partner of existing l3_mon_domain_free()).
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 55 ++++++++++++++++++------------
1 file changed, 33 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c721d1712e97..990a0c1af634 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -493,33 +493,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
}
}
-static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct list_head *add_pos = NULL;
struct rdt_hw_mon_domain *hw_dom;
- struct rdt_domain_hdr *hdr;
struct rdt_mon_domain *d;
int err;
- lockdep_assert_held(&domain_list_lock);
-
- if (id < 0) {
- pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->mon_scope, r->name);
- return;
- }
-
- hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
- if (hdr) {
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
-
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- return;
- }
-
hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
if (!hw_dom)
return;
@@ -553,6 +532,38 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
}
}
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_domain_hdr *hdr;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr) {
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return;
+ cpumask_set_cpu(cpu, &hdr->cpu_mask);
+
+ return;
+ }
+
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ l3_mon_domain_setup(cpu, id, r, add_pos);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ }
+}
+
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
if (r->alloc_capable)
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 09/29] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (7 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 08/29] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:32 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr Tony Luck
` (21 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Historically all monitoring events have been associated with the L3
resource. This will change when support for telemetry events is added.
The RDT_RESOURCE_L3 resource carries a lot of state in the domain
structures which needs to be dealt with when a domain is taken offline
by removing the last CPU in the domain.
Refactor domain_remove_cpu_mon() so all the L3 processing is separated
from general actions of clearing the CPU bit in the mask and removing
directories from mon_data.
resctrl_offline_mon_domain() will still need to remove domain specific
directories and files from the "mon_data" directories, but can skip the
L3 resource specific cleanup when called for other resource types.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++------
fs/resctrl/rdtgroup.c | 5 ++++-
2 files changed, 15 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 990a0c1af634..e4125161ffbd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -643,17 +643,22 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- hw_dom = resctrl_to_arch_mon_dom(d);
+ cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
+ return;
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, d);
list_del_rcu(&d->hdr.list);
synchronize_rcu();
l3_mon_domain_free(hw_dom);
-
- return;
+ break;
+ default:
+ pr_warn_once("Unknown resource rid=%d\n", r->rid);
+ break;
}
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 8aa9a7e68a59..828c743ec470 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4041,6 +4041,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
if (resctrl_mounted && resctrl_arch_mon_capable())
rmdir_mondata_subdir_allrdtgrp(r, d);
+ if (r->rid != RDT_RESOURCE_L3)
+ goto done;
+
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4057,7 +4060,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
}
domain_destroy_mon_state(d);
-
+done:
mutex_unlock(&rdtgroup_mutex);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (8 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 09/29] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-22 0:01 ` Keshavamurthy, Anil S
` (2 more replies)
2025-05-21 22:50 ` [PATCH v5 11/29] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
` (20 subsequent siblings)
30 siblings, 3 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Historically all monitoring events have been associated with the L3
resource and it made sense to use "struct rdt_mon_domain *" arguments
to functions manipulating domains. But the addition of monitor events
tied to other resources changes this assumption.
Some functionality like:
*) adding a CPU to an existing domain
*) removing a CPU that is not the last one from a domain
can be achieved with just access to the rdt_domain_hdr structure.
Change arguments from "rdt_*_domain" to rdt_domain_hdr so functions
can be used on domains from any resource.
Add sanity checks where container_of() is used to find the surrounding
domain structure that hdr has the expected type.
Simplify code that uses "d->hdr." to "hdr->" where possible.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 4 +-
arch/x86/kernel/cpu/resctrl/core.c | 39 +++++++-------
fs/resctrl/rdtgroup.c | 83 +++++++++++++++++++++---------
3 files changed, 79 insertions(+), 47 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d6b09952ef92..c02a4d59f3eb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -444,9 +444,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e4125161ffbd..71b884f25475 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -458,9 +458,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
if (hdr) {
if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
- d = container_of(hdr, struct rdt_ctrl_domain, hdr);
-
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ cpumask_set_cpu(cpu, &hdr->cpu_mask);
if (r->cache.arch_has_per_cpu_cfg)
rdt_domain_reconfigure_cdp(r);
return;
@@ -524,7 +522,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
list_add_tail_rcu(&d->hdr.list, add_pos);
- err = resctrl_online_mon_domain(r, d);
+ err = resctrl_online_mon_domain(r, &d->hdr);
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
@@ -597,25 +595,24 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
+ cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
+ return;
+
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
hw_dom = resctrl_to_arch_ctrl_dom(d);
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_ctrl_domain(r, d);
- list_del_rcu(&d->hdr.list);
- synchronize_rcu();
-
- /*
- * rdt_ctrl_domain "d" is going to be freed below, so clear
- * its pointer from pseudo_lock_region struct.
- */
- if (d->plr)
- d->plr->d = NULL;
- ctrl_domain_free(hw_dom);
+ resctrl_offline_ctrl_domain(r, d);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
- return;
- }
+ /*
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
+ * its pointer from pseudo_lock_region struct.
+ */
+ if (d->plr)
+ d->plr->d = NULL;
+ ctrl_domain_free(hw_dom);
}
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
@@ -651,8 +648,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
case RDT_RESOURCE_L3:
d = container_of(hdr, struct rdt_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
- resctrl_offline_mon_domain(r, d);
- list_del_rcu(&d->hdr.list);
+ resctrl_offline_mon_domain(r, hdr);
+ list_del_rcu(&hdr->list);
synchronize_rcu();
l3_mon_domain_free(hw_dom);
break;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 828c743ec470..0213fb3a1113 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3022,7 +3022,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
* when last domain being summed is removed.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
char subname[32];
@@ -3030,9 +3030,17 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
char name[32];
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
- if (snc_mode)
- sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ if (snc_mode) {
+ struct rdt_mon_domain *d;
+
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
+ } else {
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ }
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
@@ -3042,11 +3050,12 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
-static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp,
bool do_sum)
{
struct rmid_read rr = {0};
+ struct rdt_mon_domain *d;
struct mon_data *priv;
struct mon_evt *mevt;
int ret, domid;
@@ -3054,7 +3063,14 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
if (mevt->rid != r->rid || !mevt->enabled)
continue;
- domid = do_sum ? d->ci->id : d->hdr.id;
+ if (r->rid == RDT_RESOURCE_L3) {
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return -EINVAL;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ domid = do_sum ? d->ci->id : d->hdr.id;
+ } else {
+ domid = hdr->id;
+ }
priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
if (WARN_ON_ONCE(!priv))
return -EINVAL;
@@ -3063,18 +3079,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
if (ret)
return ret;
- if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
+ if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
+ mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt->evtid, true);
}
return 0;
}
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_mon_domain *d,
+ struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
+ struct rdt_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
@@ -3082,7 +3099,14 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
lockdep_assert_held(&rdtgroup_mutex);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
+ if (snc_mode) {
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return -EINVAL;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+ } else {
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ }
kn = kernfs_find_and_get(parent_kn, name);
if (kn) {
/*
@@ -3098,13 +3122,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
ret = rdtgroup_kn_set_ugid(kn);
if (ret)
goto out_destroy;
- ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
+ ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
if (ret)
goto out_destroy;
}
if (snc_mode) {
- sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
if (IS_ERR(ckn)) {
ret = -EINVAL;
@@ -3115,7 +3139,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (ret)
goto out_destroy;
- ret = mon_add_all_files(ckn, d, r, prgrp, false);
+ ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
if (ret)
goto out_destroy;
}
@@ -3133,7 +3157,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3141,12 +3165,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
parent_kn = crgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, crgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
}
}
}
@@ -3155,14 +3179,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_mon_domain *dom;
+ struct rdt_domain_hdr *hdr;
int ret;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(dom, &r->mon_domains, hdr.list) {
- ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
+ list_for_each_entry(hdr, &r->mon_domains, list) {
+ ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
if (ret)
return ret;
}
@@ -4030,8 +4054,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
mutex_unlock(&rdtgroup_mutex);
}
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
+ struct rdt_mon_domain *d;
+
mutex_lock(&rdtgroup_mutex);
/*
@@ -4039,11 +4065,15 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d);
+ rmdir_mondata_subdir_allrdtgrp(r, hdr);
if (r->rid != RDT_RESOURCE_L3)
goto done;
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ return;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4126,12 +4156,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
return err;
}
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- int err;
+ struct rdt_mon_domain *d;
+ int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ goto out_unlock;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4152,7 +4187,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
* If resctrl is mounted, add per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- mkdir_mondata_subdir_allrdtgrp(r, d);
+ mkdir_mondata_subdir_allrdtgrp(r, hdr);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 11/29] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (9 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:40 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 12/29] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
` (19 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Historically all monitoring events have been associated with the L3
resource. This will change when support for telemetry events is added.
The structures to track monitor domains at both the file system and
architecture level have generic names. This may cause confusion when
support for monitoring events in other resources is added.
Rename by adding "l3_" into the names:
rdt_mon_domain -> rdt_l3_mon_domain
rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 18 ++++++------
arch/x86/kernel/cpu/resctrl/internal.h | 14 ++++-----
fs/resctrl/internal.h | 28 +++++++++---------
arch/x86/kernel/cpu/resctrl/core.c | 14 ++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 18 ++++++------
fs/resctrl/ctrlmondata.c | 6 ++--
fs/resctrl/monitor.c | 28 +++++++++---------
fs/resctrl/rdtgroup.c | 40 +++++++++++++-------------
8 files changed, 83 insertions(+), 83 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c02a4d59f3eb..b7a4c7bf4feb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -166,7 +166,7 @@ struct rdt_ctrl_domain {
};
/**
- * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
+ * struct rdt_l3_mon_domain - group of CPUs sharing the resctrl L3 monitor resource
* @hdr: common header for different domain types
* @ci: cache info for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
@@ -176,7 +176,7 @@ struct rdt_ctrl_domain {
* @mbm_work_cpu: worker CPU for MBM h/w counters
* @cqm_work_cpu: worker CPU for CQM h/w counters
*/
-struct rdt_mon_domain {
+struct rdt_l3_mon_domain {
struct rdt_domain_hdr hdr;
struct cacheinfo *ci;
unsigned long *rmid_busy_llc;
@@ -332,10 +332,10 @@ struct resctrl_cpu_defaults {
};
struct resctrl_mon_config_info {
- struct rdt_resource *r;
- struct rdt_mon_domain *d;
- u32 evtid;
- u32 mon_config;
+ struct rdt_resource *r;
+ struct rdt_l3_mon_domain *d;
+ u32 evtid;
+ u32 mon_config;
};
/**
@@ -475,7 +475,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *arch_mon_ctx);
@@ -522,7 +522,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid,
enum resctrl_event_id eventid);
@@ -535,7 +535,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/**
* resctrl_arch_reset_all_ctrls() - Reset the control for each CLOSID to its
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 038c888dcdcf..02c9e7d163dc 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -51,15 +51,15 @@ struct rdt_hw_ctrl_domain {
};
/**
- * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
- * a resource for a monitor function
+ * struct rdt_hw_l3_mon_domain - Arch private attributes of a set of CPUs that share
+ * the L3 resource for a monitor function
* @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_states: arch private state for each MBM event
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_mon_domain {
- struct rdt_mon_domain d_resctrl;
+struct rdt_hw_l3_mon_domain {
+ struct rdt_l3_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_states[QOS_NUM_L3_MBM_EVENTS];
};
@@ -68,9 +68,9 @@ static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctr
return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
}
-static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
+static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3_mon_domain *r)
{
- return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
}
/**
@@ -122,7 +122,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
extern struct rdt_hw_resource rdt_resources_all[];
-void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 94e635656261..8659ee33b76f 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -111,15 +111,15 @@ struct mon_data {
* @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only).
*/
struct rmid_read {
- struct rdtgroup *rgrp;
- struct rdt_resource *r;
- struct rdt_mon_domain *d;
- enum resctrl_event_id evtid;
- bool first;
- struct cacheinfo *ci;
- int err;
- u64 val;
- void *arch_mon_ctx;
+ struct rdtgroup *rgrp;
+ struct rdt_resource *r;
+ struct rdt_l3_mon_domain *d;
+ enum resctrl_event_id evtid;
+ bool first;
+ struct cacheinfo *ci;
+ int err;
+ u64 val;
+ void *arch_mon_ctx;
};
extern struct list_head resctrl_schema_all;
@@ -349,12 +349,12 @@ void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first);
int resctrl_mon_resource_init(void);
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
@@ -362,14 +362,14 @@ void mbm_handle_overflow(struct work_struct *work);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_mon_domain *d);
+bool has_busy_rmid(struct rdt_l3_mon_domain *d);
-void __check_limbo(struct rdt_mon_domain *d, bool force_free);
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free);
void resctrl_file_fflags_init(const char *config, unsigned long fflags);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 71b884f25475..b39537658618 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -362,7 +362,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void l3_mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
+static void l3_mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
{
for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
kfree(hw_dom->arch_mbm_states[i]);
@@ -397,7 +397,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
{
size_t tsize = sizeof(struct arch_mbm_state);
enum resctrl_event_id evt;
@@ -493,8 +493,8 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- struct rdt_hw_mon_domain *hw_dom;
- struct rdt_mon_domain *d;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
int err;
hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
@@ -618,9 +618,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_mon_domain *hw_dom;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
- struct rdt_mon_domain *d;
lockdep_assert_held(&domain_list_lock);
@@ -646,7 +646,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
switch (r->rid) {
case RDT_RESOURCE_L3:
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 659265330783..1f6dc253112f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -108,7 +108,7 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
*
* In RMID sharing mode there are fewer "logical RMID" values available
* to accumulate data ("physical RMIDs" are divided evenly between SNC
- * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
+ * nodes that share an L3 cache). Linux creates an rdt_l3_mon_domain for
* each SNC node.
*
* The value loaded into IA32_PQR_ASSOC is the "logical RMID".
@@ -156,7 +156,7 @@ static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_l3_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -170,11 +170,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
return state ? &state[rmid] : NULL;
}
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid,
enum resctrl_event_id eventid)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
u32 prmid;
@@ -193,9 +193,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
enum resctrl_event_id evt;
int idx;
@@ -216,11 +216,11 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
@@ -264,7 +264,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
* must adjust RMID counter numbers based on SNC node. See
* logical_rmid_to_physical_rmid() for code that does this.
*/
-void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+void arch_l3_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
if (snc_nodes_per_l3_cache > 1)
msr_clear_bit(MSR_RMID_SNC_CONFIG, 0);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 7d16f7eb6985..6db24f7a3de5 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -547,7 +547,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
}
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first)
{
int cpu;
@@ -590,9 +590,9 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
struct kernfs_open_file *of = m->private;
enum resctrl_res_level resid;
enum resctrl_event_id evtid;
+ struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
- struct rdt_mon_domain *d;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
struct mon_data *md;
@@ -642,7 +642,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
}
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 4cd0789998bf..c1d248a7fdbc 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -130,7 +130,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_mon_domain *d, bool force_free)
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -188,7 +188,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}
-bool has_busy_rmid(struct rdt_mon_domain *d)
+bool has_busy_rmid(struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -289,7 +289,7 @@ int alloc_rmid(u32 closid)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u32 idx;
lockdep_assert_held(&rdtgroup_mutex);
@@ -342,7 +342,7 @@ void free_rmid(u32 closid, u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}
-static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -359,7 +359,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
int cpu = smp_processor_id();
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct mbm_state *m;
int err, ret;
u64 tval = 0;
@@ -532,7 +532,7 @@ static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu,
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_l3_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
@@ -600,7 +600,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
}
-static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id evtid)
{
struct rmid_read rr = {0};
@@ -627,7 +627,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *
resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
-static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid)
{
/*
@@ -648,12 +648,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
- d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
__check_limbo(d, false);
@@ -676,7 +676,7 @@ void cqm_handle_limbo(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -693,7 +693,7 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
@@ -708,7 +708,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;
r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- d = container_of(work, struct rdt_mon_domain, mbm_over.work);
+ d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
@@ -742,7 +742,7 @@ void mbm_handle_overflow(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 0213fb3a1113..39e046fba60a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1615,7 +1615,7 @@ static void mondata_config_read(struct resctrl_mon_config_info *mon_info)
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct resctrl_mon_config_info mon_info;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
cpus_read_lock();
@@ -1663,7 +1663,7 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
}
static void mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_mon_domain *d, u32 evtid, u32 val)
+ struct rdt_l3_mon_domain *d, u32 evtid, u32 val)
{
struct resctrl_mon_config_info mon_info = {0};
@@ -1704,8 +1704,8 @@ static void mbm_config_write_domain(struct rdt_resource *r,
static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
char *dom_str = NULL, *id_str;
+ struct rdt_l3_mon_domain *d;
unsigned long dom_id, val;
- struct rdt_mon_domain *d;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -2579,7 +2579,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
struct rdt_resource *r;
int ret;
@@ -3031,11 +3031,11 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
if (snc_mode) {
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
} else {
@@ -3054,8 +3054,8 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp,
bool do_sum)
{
+ struct rdt_l3_mon_domain *d;
struct rmid_read rr = {0};
- struct rdt_mon_domain *d;
struct mon_data *priv;
struct mon_evt *mevt;
int ret, domid;
@@ -3066,7 +3066,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
if (r->rid == RDT_RESOURCE_L3) {
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
domid = do_sum ? d->ci->id : d->hdr.id;
} else {
domid = hdr->id;
@@ -3091,7 +3091,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
@@ -3102,7 +3102,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (snc_mode) {
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
} else {
sprintf(name, "mon_%s_%02d", r->name, hdr->id);
@@ -4035,7 +4035,7 @@ static void rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}
-static void domain_destroy_mon_state(struct rdt_mon_domain *d)
+static void domain_destroy_mon_state(struct rdt_l3_mon_domain *d)
{
bitmap_free(d->rmid_busy_llc);
for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++) {
@@ -4056,7 +4056,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
mutex_lock(&rdtgroup_mutex);
@@ -4073,7 +4073,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4107,7 +4107,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
*
* Returns 0 for success, or -ENOMEM.
*/
-static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
+static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(struct mbm_state);
@@ -4158,7 +4158,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
@@ -4166,7 +4166,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
goto out_unlock;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4213,10 +4213,10 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
}
}
-static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
- struct rdt_resource *r)
+static struct rdt_l3_mon_domain *get_mon_domain_from_cpu(int cpu,
+ struct rdt_resource *r)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
lockdep_assert_cpus_held();
@@ -4232,7 +4232,7 @@ static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
void resctrl_offline_cpu(unsigned int cpu)
{
struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct rdtgroup *rdtgrp;
mutex_lock(&rdtgroup_mutex);
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 12/29] fs/resctrl: Make event details accessible to functions when reading events
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (10 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 11/29] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-21 22:50 ` [PATCH v5 13/29] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
` (18 subsequent siblings)
30 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
All details about a monitor event are kept in the mon_evt structure.
Upper levels of code only provide the event id to lower levels.
This will become a problem when new attributes are added to the
mon_evt structure.
Change the mon_data and rmid_read structures to hold a pointer
to the mon_evt structure instead of just taking a copy of the
event id.
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 10 +++++-----
fs/resctrl/ctrlmondata.c | 16 ++++++++--------
fs/resctrl/monitor.c | 17 +++++++++--------
fs/resctrl/rdtgroup.c | 6 +++---
4 files changed, 25 insertions(+), 24 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 8659ee33b76f..085a2ee1922f 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -73,7 +73,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
* @rid: Resource id associated with the event file.
- * @evtid: Event id associated with the event file.
+ * @evt: Event associated with the event file.
* @sum: Set when event must be summed across multiple
* domains.
* @domid: When @sum is zero this is the domain to which
@@ -87,7 +87,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
struct mon_data {
struct list_head list;
enum resctrl_res_level rid;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
int domid;
bool sum;
};
@@ -100,7 +100,7 @@ struct mon_data {
* @r: Resource describing the properties of the event being read.
* @d: Domain that the counter should be read from. If NULL then sum all
* domains in @r sharing L3 @ci.id
- * @evtid: Which monitor event to read.
+ * @evt: Which monitor event to read.
* @first: Initialize MBM counter when true.
* @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
* @err: Error encountered when reading counter.
@@ -114,7 +114,7 @@ struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
struct rdt_l3_mon_domain *d;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
bool first;
struct cacheinfo *ci;
int err;
@@ -350,7 +350,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first);
+ cpumask_t *cpumask, struct mon_evt *evt, int first);
int resctrl_mon_resource_init(void);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 6db24f7a3de5..dcde27f6f2ec 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -548,7 +548,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first)
+ cpumask_t *cpumask, struct mon_evt *evt, int first)
{
int cpu;
@@ -559,11 +559,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* Setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
- rr->evtid = evtid;
+ rr->evt = evt;
rr->r = r;
rr->d = d;
rr->first = first;
- rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid);
+ rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evt->evtid);
if (IS_ERR(rr->arch_mon_ctx)) {
rr->err = -EINVAL;
return;
@@ -582,20 +582,20 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
- resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
+ resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
enum resctrl_res_level resid;
- enum resctrl_event_id evtid;
struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
struct mon_data *md;
+ struct mon_evt *evt;
int domid, ret = 0;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
@@ -612,7 +612,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
resid = md->rid;
domid = md->domid;
- evtid = md->evtid;
+ evt = md->evt;
r = resctrl_arch_get_resource(resid);
if (md->sum) {
@@ -626,7 +626,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
if (d->ci->id == domid) {
rr.ci = d->ci;
mon_event_read(&rr, r, NULL, rdtgrp,
- &d->ci->shared_cpu_map, evtid, false);
+ &d->ci->shared_cpu_map, evt, false);
goto checkresult;
}
}
@@ -643,7 +643,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
goto out;
}
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
+ mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evt, false);
}
checkresult:
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index c1d248a7fdbc..3cfd1bf1845e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -365,8 +365,8 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
u64 tval = 0;
if (rr->first) {
- resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evt->evtid);
+ m = get_mbm_state(rr->d, closid, rmid, rr->evt->evtid);
if (m)
memset(m, 0, sizeof(struct mbm_state));
return 0;
@@ -377,7 +377,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
return -EINVAL;
rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -402,7 +402,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
if (d->ci->id != rr->ci->id)
continue;
err = resctrl_arch_rmid_read(rr->r, d, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
ret = 0;
@@ -432,7 +432,7 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
u64 cur_bw, bytes, cur_bytes;
struct mbm_state *m;
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ m = get_mbm_state(rr->d, closid, rmid, rr->evt->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -603,12 +603,13 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_l3_mon_domain *dom_m
static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id evtid)
{
+ struct mon_evt *evt = &mon_event_all[evtid];
struct rmid_read rr = {0};
rr.r = r;
rr.d = d;
- rr.evtid = evtid;
- rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ rr.evt = evt;
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, evt->evtid);
if (IS_ERR(rr.arch_mon_ctx)) {
pr_warn_ratelimited("Failed to allocate monitor context: %ld",
PTR_ERR(rr.arch_mon_ctx));
@@ -624,7 +625,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domai
if (is_mba_sc(NULL))
mbm_bw_count(closid, rmid, &rr);
- resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
+ resctrl_arch_mon_ctx_free(rr.r, evt->evtid, rr.arch_mon_ctx);
}
static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 39e046fba60a..67482f1110b3 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2897,7 +2897,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
if (priv->rid == rid && priv->domid == domid &&
- priv->sum == do_sum && priv->evtid == mevt->evtid)
+ priv->sum == do_sum && priv->evt == mevt)
return priv;
}
@@ -2908,7 +2908,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
priv->rid = rid;
priv->domid = domid;
priv->sum = do_sum;
- priv->evtid = mevt->evtid;
+ priv->evt = mevt;
list_add_tail(&priv->list, &mon_data_kn_priv_list);
return priv;
@@ -3080,7 +3080,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
return ret;
if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt->evtid, true);
+ mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt, true);
}
return 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 13/29] x86,fs/resctrl: Handle events that can be read from any CPU
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (11 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 12/29] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:42 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
` (17 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Resctrl file system code was built with the assumption that monitor
events can only be read from a CPU in the cpumask_t set for each
domain.
This was true for x86 events accessed with an MSR interface, but may
not be true for other access methods such as MMIO.
Add a flag to struct mon_evt to indicate which events can be read on
any CPU.
Architecture uses resctrl_enable_mon_event() to enable an event and
set the flag appropriately.
Bypass all the smp_call*() code for events that can be read on any CPU
and call mon_event_count() directly from mon_event_read().
Skip checks in __mon_event_count() that the read is being done from
a CPU in the correct domain or cache scope.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
fs/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 +++---
fs/resctrl/ctrlmondata.c | 7 ++++++-
fs/resctrl/monitor.c | 26 ++++++++++++++++++++------
5 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index b7a4c7bf4feb..9aab3d78005a 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -377,7 +377,7 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
-void resctrl_enable_mon_event(enum resctrl_event_id evtid);
+void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 085a2ee1922f..eb6e92d1ab15 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -57,6 +57,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @rid: index of the resource for this event
* @name: name of the event
* @configurable: true if the event is configurable
+ * @any_cpu: true if the event can be read from any CPU
* @enabled: true if the event is enabled
*/
struct mon_evt {
@@ -64,6 +65,7 @@ struct mon_evt {
enum resctrl_res_level rid;
char *name;
bool configurable;
+ bool any_cpu;
bool enabled;
};
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b39537658618..5d9a024ce4b0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -880,15 +880,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
ret = true;
}
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index dcde27f6f2ec..1337716f59c8 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -569,6 +569,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
return;
}
+ if (evt->any_cpu) {
+ mon_event_count(rr);
+ goto done;
+ }
+
cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
/*
@@ -581,7 +586,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
smp_call_function_any(cpumask, mon_event_count, rr, 1);
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
-
+done:
resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 3cfd1bf1845e..e6e3be990638 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -356,9 +356,24 @@ static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
return states ? &states[idx] : NULL;
}
+static bool cpu_on_wrong_domain(struct rmid_read *rr)
+{
+ cpumask_t *mask;
+
+ if (rr->evt->any_cpu)
+ return false;
+
+ /*
+ * When reading from a specific domain the CPU must be in that
+ * domain. Otherwise the CPU must be one that shares the cache.
+ */
+ mask = rr->d ? &rr->d->hdr.cpu_mask : &rr->ci->shared_cpu_map;
+
+ return !cpumask_test_cpu(smp_processor_id(), mask);
+}
+
static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
{
- int cpu = smp_processor_id();
struct rdt_l3_mon_domain *d;
struct mbm_state *m;
int err, ret;
@@ -373,8 +388,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
}
if (rr->d) {
- /* Reading a single domain, must be on a CPU in that domain. */
- if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
+ if (cpu_on_wrong_domain(rr))
return -EINVAL;
rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
rr->evt->evtid, &tval, rr->arch_mon_ctx);
@@ -386,8 +400,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
return 0;
}
- /* Summing domains that share a cache, must be on a CPU for that cache. */
- if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
+ if (cpu_on_wrong_domain(rr))
return -EINVAL;
/*
@@ -865,7 +878,7 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
},
};
-void resctrl_enable_mon_event(enum resctrl_event_id evtid)
+void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
{
if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
return;
@@ -874,6 +887,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid)
return;
}
+ mon_event_all[evtid].any_cpu = any_cpu;
mon_event_all[evtid].enabled = true;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (12 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 13/29] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:49 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 15/29] fs/resctrl: Add an architectural hook called for each mount Tony Luck
` (16 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Resctrl was written with the assumption that all monitor events
can be displayed as unsigned decimal integers.
Hardware architecture counters may provide some telemetry events with
greater precision where the event is not a simple count, but is a
measurement of some sort (e.g. Joules for energy consumed).
Add a new argument to resctrl_enable_mon_event() for architecture
code to inform the file system that the value for a counter is
a fixed-point value with a specific number of binary places.
Fixed point values are displayed with values rounded to an
appropriate number of decimal places.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 4 +-
fs/resctrl/internal.h | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 6 +--
fs/resctrl/ctrlmondata.c | 75 +++++++++++++++++++++++++++++-
fs/resctrl/monitor.c | 5 +-
5 files changed, 85 insertions(+), 7 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 9aab3d78005a..46ba62ee94a1 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -377,7 +377,9 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
-void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu);
+#define MAX_BINARY_BITS 27
+
+void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu, u32 binary_bits);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index eb6e92d1ab15..d5045491790e 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -58,6 +58,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @name: name of the event
* @configurable: true if the event is configurable
* @any_cpu: true if the event can be read from any CPU
+ * @binary_bits: number of fixed-point binary bits from architecture
* @enabled: true if the event is enabled
*/
struct mon_evt {
@@ -66,6 +67,7 @@ struct mon_evt {
char *name;
bool configurable;
bool any_cpu;
+ int binary_bits;
bool enabled;
};
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 5d9a024ce4b0..306afb50fd37 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -880,15 +880,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
ret = true;
}
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 1337716f59c8..07bf44834a46 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -590,6 +590,77 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
+/**
+ * struct fixed_params - parameters to decode a binary fixed point value
+ * @mask: Mask for fractional part of value.
+ * @lshift: Shift to round-up binary places.
+ * @pow10: Multiplier (10 ^ decimal places).
+ * @round: Add to round up to nearest decimal representation.
+ * @rshift: Shift back for final answer.
+ * @decplaces: Number of decimal places for this number of binary places.
+ */
+struct fixed_params {
+ u64 mask;
+ int lshift;
+ int pow10;
+ u64 round;
+ int rshift;
+ int decplaces;
+};
+
+static struct fixed_params fixed_params[MAX_BINARY_BITS + 1] = {
+ [1] = { GENMASK_ULL(1, 0), 0, 10, 0x00000000, 1, 1 },
+ [2] = { GENMASK_ULL(2, 0), 0, 100, 0x00000000, 2, 2 },
+ [3] = { GENMASK_ULL(3, 0), 0, 1000, 0x00000000, 3, 3 },
+ [4] = { GENMASK_ULL(4, 0), 2, 1000, 0x00000020, 6, 3 },
+ [5] = { GENMASK_ULL(5, 0), 1, 1000, 0x00000020, 6, 3 },
+ [6] = { GENMASK_ULL(6, 0), 0, 1000, 0x00000020, 6, 3 },
+ [7] = { GENMASK_ULL(7, 0), 2, 1000, 0x00000100, 9, 3 },
+ [8] = { GENMASK_ULL(8, 0), 1, 1000, 0x00000100, 9, 3 },
+ [9] = { GENMASK_ULL(9, 0), 0, 1000, 0x00000100, 9, 3 },
+ [10] = { GENMASK_ULL(10, 0), 2, 10000, 0x00000800, 12, 4 },
+ [11] = { GENMASK_ULL(11, 0), 1, 10000, 0x00000800, 12, 4 },
+ [12] = { GENMASK_ULL(12, 0), 0, 10000, 0x00000800, 12, 4 },
+ [13] = { GENMASK_ULL(13, 0), 2, 100000, 0x00004000, 15, 5 },
+ [14] = { GENMASK_ULL(14, 0), 1, 100000, 0x00004000, 15, 5 },
+ [15] = { GENMASK_ULL(15, 0), 0, 100000, 0x00004000, 15, 5 },
+ [16] = { GENMASK_ULL(16, 0), 2, 1000000, 0x00020000, 18, 6 },
+ [17] = { GENMASK_ULL(17, 0), 1, 1000000, 0x00020000, 18, 6 },
+ [18] = { GENMASK_ULL(18, 0), 0, 1000000, 0x00020000, 18, 6 },
+ [19] = { GENMASK_ULL(19, 0), 2, 10000000, 0x00100000, 21, 7 },
+ [20] = { GENMASK_ULL(20, 0), 1, 10000000, 0x00100000, 21, 7 },
+ [21] = { GENMASK_ULL(21, 0), 0, 10000000, 0x00100000, 21, 7 },
+ [22] = { GENMASK_ULL(22, 0), 2, 100000000, 0x00800000, 24, 8 },
+ [23] = { GENMASK_ULL(23, 0), 1, 100000000, 0x00800000, 24, 8 },
+ [24] = { GENMASK_ULL(24, 0), 0, 100000000, 0x00800000, 24, 8 },
+ [25] = { GENMASK_ULL(25, 0), 2, 1000000000, 0x04000000, 27, 9 },
+ [26] = { GENMASK_ULL(26, 0), 1, 1000000000, 0x04000000, 27, 9 },
+ [27] = { GENMASK_ULL(27, 0), 0, 1000000000, 0x04000000, 27, 9 }
+};
+
+static void print_event_value(struct seq_file *m, int binary_bits, u64 val)
+{
+ struct fixed_params *fp = &fixed_params[binary_bits];
+ unsigned long long frac;
+ char buf[10];
+
+ frac = val & fp->mask;
+ frac <<= fp->lshift;
+ frac *= fp->pow10;
+ frac += fp->round;
+ frac >>= fp->rshift;
+
+ sprintf(buf, "%0*llu", fp->decplaces, frac);
+
+ /* Trim trailing zeroes */
+ for (int i = fp->decplaces - 1; i > 0; i--) {
+ if (buf[i] != '0')
+ break;
+ buf[i] = '\0';
+ }
+ seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
+}
+
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
@@ -657,8 +728,10 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
seq_puts(m, "Error\n");
else if (rr.err == -EINVAL)
seq_puts(m, "Unavailable\n");
- else
+ else if (evt->binary_bits == 0)
seq_printf(m, "%llu\n", rr.val);
+ else
+ print_event_value(m, evt->binary_bits, rr.val);
out:
rdtgroup_kn_unlock(of->kn);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index e6e3be990638..f554d7933739 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -878,9 +878,9 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
},
};
-void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
+void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu, u32 binary_bits)
{
- if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
+ if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS) || binary_bits > MAX_BINARY_BITS)
return;
if (mon_event_all[evtid].enabled) {
pr_warn("Duplicate enable for event %d\n", evtid);
@@ -888,6 +888,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
}
mon_event_all[evtid].any_cpu = any_cpu;
+ mon_event_all[evtid].binary_bits = binary_bits;
mon_event_all[evtid].enabled = true;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 15/29] fs/resctrl: Add an architectural hook called for each mount
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (13 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:49 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 16/29] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
` (15 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Enumeration of Intel telemetry events is not complete when the
resctrl "late_init" code is executed.
Add a hook at the beginning of the mount code that will be used
to check for telemetry events and initialize if any are found.
The hook is called on every attempted mount. But expectations are that
most actions (like enumeration) will only need to be performed
on the first call.
The call is made with no locks held. Architecture code is responsible
for any required locking.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 6 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 9 +++++++++
fs/resctrl/rdtgroup.c | 2 ++
3 files changed, 17 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 46ba62ee94a1..4ad3d7f10580 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -452,6 +452,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
+/*
+ * Architecture hook called for each attempted file system mount.
+ * No locks are held.
+ */
+void resctrl_arch_pre_mount(void);
+
/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 306afb50fd37..f8c9840ce7dc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -710,6 +710,15 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
return 0;
}
+void resctrl_arch_pre_mount(void)
+{
+ static atomic_t only_once;
+ int old = 0;
+
+ if (!atomic_try_cmpxchg(&only_once, &old, 1))
+ return;
+}
+
enum {
RDT_FLAG_CMT,
RDT_FLAG_MBM_TOTAL,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 67482f1110b3..bdad98ac0d27 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2583,6 +2583,8 @@ static int rdt_get_tree(struct fs_context *fc)
struct rdt_resource *r;
int ret;
+ resctrl_arch_pre_mount();
+
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
/*
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 16/29] x86/resctrl: Add and initialize rdt_resource for package scope core monitor
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (14 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 15/29] fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-21 22:50 ` [PATCH v5 17/29] x86/resctrl: Discover hardware telemetry events Tony Luck
` (14 subsequent siblings)
30 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Counts for each Intel telemetry event are periodically sent to one or
more aggregators on each package where accumulated totals are made
available in MMIO registers.
Add a new resource for monitoring these events so that CPU hotplug
notifiers will build domains at the package granularity.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 ++
fs/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++++++
fs/resctrl/rdtgroup.c | 2 ++
4 files changed, 16 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 4ad3d7f10580..4ba51cb598e1 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,6 +53,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_PERF_PKG,
/* Must be the last */
RDT_NUM_RESOURCES,
@@ -250,6 +251,7 @@ enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
RESCTRL_L3_NODE,
+ RESCTRL_PACKAGE,
};
/**
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index d5045491790e..64c1c226d676 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -234,6 +234,8 @@ struct rdtgroup {
#define RFTYPE_DEBUG BIT(10)
+#define RFTYPE_RES_PERF_PKG BIT(11)
+
#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f8c9840ce7dc..ce4885c751e4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -99,6 +99,14 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
.schema_fmt = RESCTRL_SCHEMA_RANGE,
},
},
+ [RDT_RESOURCE_PERF_PKG] =
+ {
+ .r_resctrl = {
+ .name = "PERF_PKG",
+ .mon_scope = RESCTRL_PACKAGE,
+ .mon_domains = mon_domain_init(RDT_RESOURCE_PERF_PKG),
+ },
+ },
};
u32 resctrl_arch_system_num_rmid_idx(void)
@@ -430,6 +438,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return get_cpu_cacheinfo_id(cpu, scope);
case RESCTRL_L3_NODE:
return cpu_to_node(cpu);
+ case RESCTRL_PACKAGE:
+ return topology_physical_package_id(cpu);
default:
break;
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index bdad98ac0d27..1e1cc8001cbc 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2193,6 +2193,8 @@ static unsigned long fflags_from_resource(struct rdt_resource *r)
case RDT_RESOURCE_MBA:
case RDT_RESOURCE_SMBA:
return RFTYPE_RES_MB;
+ case RDT_RESOURCE_PERF_PKG:
+ return RFTYPE_RES_PERF_PKG;
}
return WARN_ON_ONCE(1);
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 17/29] x86/resctrl: Discover hardware telemetry events
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (15 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 16/29] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:53 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 18/29] x86/resctrl: Count valid telemetry aggregators per package Tony Luck
` (13 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Hardware has one or more telemetry event aggregators per package
for each group of telemetry events. Each aggregator provides access
to event counts in an array of 64-bit values in MMIO space. There
is a "guid" (in this case a unique 32-bit integer) which refers to
an XML file published in the https://github.com/intel/Intel-PMT
that provides all the details about each aggregator.
The XML files provide the following information:
1) Which telemetry events are included in the group for this aggregator.
2) The order in which the event counters appear for each RMID.
3) The value type of each event counter (integer or fixed-point).
4) The number of RMIDs supported.
5) Which additional aggregator status registers are included.
6) The total size of the MMIO region for this aggregator.
There is an INTEL_PMT_DISCOVERY driver that enumerate all aggregators
on the system with intel_pmt_get_regions_by_feature(). Call this for
each pmt_feature_id that indicates per-RMID telemetry.
Save the returned pmt_feature_group pointers with guids that are known
to resctrl for use at run time.
Those pointers are returned to the INTEL_PMT_DISCOVERY driver at
resctrl_arch_exit() time.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 3 +
arch/x86/kernel/cpu/resctrl/core.c | 5 +
arch/x86/kernel/cpu/resctrl/intel_aet.c | 129 ++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
4 files changed, 138 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 02c9e7d163dc..2b2d4b5a4643 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -167,4 +167,7 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
+bool intel_aet_get_events(void);
+void __exit intel_aet_exit(void);
+
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ce4885c751e4..64ce561e77a0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -727,6 +727,9 @@ void resctrl_arch_pre_mount(void)
if (!atomic_try_cmpxchg(&only_once, &old, 1))
return;
+
+ if (!intel_aet_get_events())
+ return;
}
enum {
@@ -1079,6 +1082,8 @@ late_initcall(resctrl_arch_late_init);
static void __exit resctrl_arch_exit(void)
{
+ intel_aet_exit();
+
cpuhp_remove_state(rdt_online);
resctrl_exit();
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
new file mode 100644
index 000000000000..df73b9476c4d
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Resource Director Technology(RDT)
+ * - Intel Application Energy Telemetry
+ *
+ * Copyright (C) 2025 Intel Corporation
+ *
+ * Author:
+ * Tony Luck <tony.luck@intel.com>
+ */
+
+#define pr_fmt(fmt) "resctrl: " fmt
+
+#include <linux/cleanup.h>
+#include <linux/cpu.h>
+#include <linux/resctrl.h>
+
+/* Temporary - delete from final version */
+#include "fake_intel_aet_features.h"
+
+#include "internal.h"
+
+/**
+ * struct event_group - All information about a group of telemetry events.
+ * @pfg: Points to the aggregated telemetry space information
+ * within the OOBMSM driver that contains data for all
+ * telemetry regions.
+ * @guid: Unique number per XML description file.
+ */
+struct event_group {
+ /* Data fields used by this code. */
+ struct pmt_feature_group *pfg;
+
+ /* Remaining fields initialized from XML file. */
+ u32 guid;
+};
+
+/*
+ * Link: https://github.com/intel/Intel-PMT
+ * File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
+ */
+static struct event_group energy_0x26696143 = {
+ .guid = 0x26696143,
+};
+
+/*
+ * Link: https://github.com/intel/Intel-PMT
+ * File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
+ */
+static struct event_group perf_0x26557651 = {
+ .guid = 0x26557651,
+};
+
+static struct event_group *known_event_groups[] = {
+ &energy_0x26696143,
+ &perf_0x26557651,
+};
+
+#define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
+
+/* Stub for now */
+static int configure_events(struct event_group *e, struct pmt_feature_group *p)
+{
+ return -EINVAL;
+}
+
+DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
+ if (!IS_ERR_OR_NULL(_T)) \
+ intel_pmt_put_feature_group(_T))
+
+/*
+ * Make a request to the INTEL_PMT_DISCOVERY driver for the
+ * pmt_feature_group for a specific feature. If there is
+ * one the returned structure has an array of telemetry_region
+ * structures. Each describes one telemetry aggregator.
+ * Try to configure any with a known matching guid.
+ */
+static bool get_pmt_feature(enum pmt_feature_id feature)
+{
+ struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
+ struct event_group **peg;
+ bool ret;
+
+ p = intel_pmt_get_regions_by_feature(feature);
+
+ if (IS_ERR_OR_NULL(p))
+ return false;
+
+ for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
+ for (int i = 0; i < p->count; i++) {
+ if ((*peg)->guid == p->regions[i].guid) {
+ ret = configure_events(*peg, p);
+ if (!ret) {
+ (*peg)->pfg = no_free_ptr(p);
+ return true;
+ }
+ break;
+ }
+ }
+ }
+
+ return false;
+}
+
+/*
+ * Ask OOBMSM discovery driver for all the RMID based telemetry groups
+ * that it supports.
+ */
+bool intel_aet_get_events(void)
+{
+ bool ret1, ret2;
+
+ ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM);
+ ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM);
+
+ return ret1 || ret2;
+}
+
+void __exit intel_aet_exit(void)
+{
+ struct event_group **peg;
+
+ for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
+ if ((*peg)->pfg) {
+ intel_pmt_put_feature_group((*peg)->pfg);
+ (*peg)->pfg = NULL;
+ }
+ }
+}
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index cf4fac58d068..cca23f06d15d 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_X86_CPU_RESCTRL) += intel_aet.o
obj-$(CONFIG_X86_CPU_RESCTRL) += fake_intel_aet_features.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 18/29] x86/resctrl: Count valid telemetry aggregators per package
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (16 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 17/29] x86/resctrl: Discover hardware telemetry events Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:54 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 19/29] x86/resctrl: Complete telemetry event enumeration Tony Luck
` (12 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There may be multiple telemetry aggregators per package, each enumerated
by a telemetry region structure in the feature group.
Scan the array of telemetry region structures and count how many are
in each package in preparation to allocate structures to save the MMIO
addresses for each in a convenient format for use when reading event
counters.
Sanity check that the telemetry region structures have a valid
package_id and that the size they report for the MMIO space is as
large as expected from the XML description of the registers in
the region.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 45 ++++++++++++++++++++++++-
1 file changed, 44 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index df73b9476c4d..ffcb54be54ea 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -14,6 +14,7 @@
#include <linux/cleanup.h>
#include <linux/cpu.h>
#include <linux/resctrl.h>
+#include <linux/slab.h>
/* Temporary - delete from final version */
#include "fake_intel_aet_features.h"
@@ -26,6 +27,7 @@
* within the OOBMSM driver that contains data for all
* telemetry regions.
* @guid: Unique number per XML description file.
+ * @mmio_size: Number of bytes of MMIO registers for this group.
*/
struct event_group {
/* Data fields used by this code. */
@@ -33,6 +35,7 @@ struct event_group {
/* Remaining fields initialized from XML file. */
u32 guid;
+ size_t mmio_size;
};
/*
@@ -41,6 +44,7 @@ struct event_group {
*/
static struct event_group energy_0x26696143 = {
.guid = 0x26696143,
+ .mmio_size = (576 * 2 + 3) * 8,
};
/*
@@ -49,6 +53,7 @@ static struct event_group energy_0x26696143 = {
*/
static struct event_group perf_0x26557651 = {
.guid = 0x26557651,
+ .mmio_size = (576 * 7 + 3) * 8,
};
static struct event_group *known_event_groups[] = {
@@ -58,9 +63,47 @@ static struct event_group *known_event_groups[] = {
#define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
-/* Stub for now */
+static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
+{
+ if (tr->guid != e->guid)
+ return true;
+ if (tr->plat_info.package_id >= topology_max_packages()) {
+ pr_warn_once("Bad package %d in guid 0x%x\n", tr->plat_info.package_id,
+ tr->guid);
+ return true;
+ }
+ if (tr->size < e->mmio_size) {
+ pr_warn_once("MMIO space too small for guid 0x%x\n", e->guid);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Configure events from one pmt_feature_group.
+ * 1) Count how many per package.
+ * 2...) To be continued.
+ */
static int configure_events(struct event_group *e, struct pmt_feature_group *p)
{
+ int *pkgcounts __free(kfree) = NULL;
+ struct telemetry_region *tr;
+ int num_pkgs;
+
+ num_pkgs = topology_max_packages();
+ pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
+ if (!pkgcounts)
+ return -ENOMEM;
+
+ /* Get per-package counts of telemetry_regions for this event group */
+ for (int i = 0; i < p->count; i++) {
+ tr = &p->regions[i];
+ if (skip_this_region(tr, e))
+ continue;
+ pkgcounts[tr->plat_info.package_id]++;
+ }
+
return -EINVAL;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 19/29] x86/resctrl: Complete telemetry event enumeration
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (17 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 18/29] x86/resctrl: Count valid telemetry aggregators per package Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 4:05 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
` (11 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Counters for telemetry events are in MMIO space. Each telemetry_region
structure returned in the pmt_feature_group returned from OOBMSM
contains the base MMIO address for the counters.
Scan all the telemetry_region structures again and gather these
addresses into a more convenient structure with addresses for
each aggregator indexed by package id. Note that there may be
multiple aggregators per package.
Completed structure for each event group looks like this:
+---------------------+---------------------+
pkginfo** -->| pkginfo[0] | pkginfo[1] |
+---------------------+---------------------+
| |
v v
+----------------+ +----------------+
|struct mmio_info| |struct mmio_info|
+----------------+ +----------------+
| count = N | | count = N |
| addrs[0] | | addrs[0] |
| addrs[0] | | addrs[0] |
| ... | | ... |
| addrs[N-1] | | addrs[N-1] |
+----------------+ +----------------+
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 65 ++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index ffcb54be54ea..2316198eb69e 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -21,17 +21,32 @@
#include "internal.h"
+/**
+ * struct mmio_info - Array of MMIO addresses for one event group for a package
+ * @count: Number of addresses on this package
+ * @addrs: The MMIO addresses
+ *
+ * Provides convenient access to all MMIO addresses of one event group
+ * for one package. Used when reading event data on a package.
+ */
+struct mmio_info {
+ int count;
+ void __iomem *addrs[] __counted_by(count);
+};
+
/**
* struct event_group - All information about a group of telemetry events.
* @pfg: Points to the aggregated telemetry space information
* within the OOBMSM driver that contains data for all
* telemetry regions.
+ * @pkginfo: Per-package MMIO addresses of telemetry regions belonging to this group
* @guid: Unique number per XML description file.
* @mmio_size: Number of bytes of MMIO registers for this group.
*/
struct event_group {
/* Data fields used by this code. */
struct pmt_feature_group *pfg;
+ struct mmio_info **pkginfo;
/* Remaining fields initialized from XML file. */
u32 guid;
@@ -80,6 +95,20 @@ static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
return false;
}
+static void free_mmio_info(struct mmio_info **mmi)
+{
+ int num_pkgs = topology_max_packages();
+
+ if (!mmi)
+ return;
+
+ for (int i = 0; i < num_pkgs; i++)
+ kfree(mmi[i]);
+ kfree(mmi);
+}
+
+DEFINE_FREE(mmio_info, struct mmio_info **, free_mmio_info(_T))
+
/*
* Configure events from one pmt_feature_group.
* 1) Count how many per package.
@@ -87,10 +116,17 @@ static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
*/
static int configure_events(struct event_group *e, struct pmt_feature_group *p)
{
+ struct mmio_info __free(mmio_info) **pkginfo = NULL;
int *pkgcounts __free(kfree) = NULL;
struct telemetry_region *tr;
+ struct mmio_info *mmi;
int num_pkgs;
+ if (e->pkginfo) {
+ pr_warn("Duplicate telemetry information for guid 0x%x\n", e->guid);
+ return -EINVAL;
+ }
+
num_pkgs = topology_max_packages();
pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
if (!pkgcounts)
@@ -104,7 +140,33 @@ static int configure_events(struct event_group *e, struct pmt_feature_group *p)
pkgcounts[tr->plat_info.package_id]++;
}
- return -EINVAL;
+ /* Allocate array for per-package struct mmio_info data */
+ pkginfo = kcalloc(num_pkgs, sizeof(*pkginfo), GFP_KERNEL);
+ if (!pkginfo)
+ return -ENOMEM;
+
+ /*
+ * Allocate per-package mmio_info structures and initialize
+ * count of telemetry_regions in each one.
+ */
+ for (int i = 0; i < num_pkgs; i++) {
+ pkginfo[i] = kzalloc(struct_size(pkginfo[i], addrs, pkgcounts[i]), GFP_KERNEL);
+ if (!pkginfo[i])
+ return -ENOMEM;
+ pkginfo[i]->count = pkgcounts[i];
+ }
+
+ /* Save MMIO address(es) for each telemetry region in per-package structures */
+ for (int i = 0; i < p->count; i++) {
+ tr = &p->regions[i];
+ if (skip_this_region(tr, e))
+ continue;
+ mmi = pkginfo[tr->plat_info.package_id];
+ mmi->addrs[--pkgcounts[tr->plat_info.package_id]] = tr->addr;
+ }
+ e->pkginfo = no_free_ptr(pkginfo);
+
+ return 0;
}
DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
@@ -168,5 +230,6 @@ void __exit intel_aet_exit(void)
intel_pmt_put_feature_group((*peg)->pfg);
(*peg)->pfg = NULL;
}
+ free_mmio_info((*peg)->pkginfo);
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (18 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 19/29] x86/resctrl: Complete telemetry event enumeration Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 3:57 ` Reinette Chatre
2025-06-07 0:57 ` Fenghua Yu
2025-05-21 22:50 ` [PATCH v5 21/29] x86/resctrl: x86/resctrl: Read core telemetry events Tony Luck
` (10 subsequent siblings)
30 siblings, 2 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Clearwater Forest supports two energy related telemetry events
and seven perf style events. The counters are arranged in per-RMID
blocks like this:
MMIO offset:0x00 Counter for RMID 0 Event 0
MMIO offset:0x08 Counter for RMID 0 Event 1
MMIO offset:0x10 Counter for RMID 0 Event 2
MMIO offset:0x18 Counter for RMID 1 Event 0
MMIO offset:0x20 Counter for RMID 1 Event 1
MMIO offset:0x28 Counter for RMID 1 Event 2
...
Define these events in the file system code and add the events
to the event_group structures.
PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point
format. File system code must output as floating point values.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl_types.h | 11 ++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 33 ++++++++++++++++++
fs/resctrl/monitor.c | 45 +++++++++++++++++++++++++
3 files changed, 89 insertions(+)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index b468bfbab9ea..455b29a0a9b9 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -43,6 +43,17 @@ enum resctrl_event_id {
QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
+ /* Intel Telemetry Events */
+ PMT_EVENT_ENERGY,
+ PMT_EVENT_ACTIVITY,
+ PMT_EVENT_STALLS_LLC_HIT,
+ PMT_EVENT_C1_RES,
+ PMT_EVENT_UNHALTED_CORE_CYCLES,
+ PMT_EVENT_STALLS_LLC_MISS,
+ PMT_EVENT_AUTO_C6_RES,
+ PMT_EVENT_UNHALTED_REF_CYCLES,
+ PMT_EVENT_UOPS_RETIRED,
+
/* Must be the last */
QOS_NUM_EVENTS,
};
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 2316198eb69e..bf8e2a6126d2 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -34,6 +34,20 @@ struct mmio_info {
void __iomem *addrs[] __counted_by(count);
};
+/**
+ * struct pmt_event - Telemetry event.
+ * @evtid: Resctrl event id
+ * @evt_idx: Counter index within each per-RMID block of counters
+ * @bin_bits: Zero for integer valued events, else number bits in fixed-point
+ */
+struct pmt_event {
+ enum resctrl_event_id evtid;
+ int evt_idx;
+ int bin_bits;
+};
+
+#define EVT(id, idx, bits) { .evtid = id, .evt_idx = idx, .bin_bits = bits }
+
/**
* struct event_group - All information about a group of telemetry events.
* @pfg: Points to the aggregated telemetry space information
@@ -42,6 +56,8 @@ struct mmio_info {
* @pkginfo: Per-package MMIO addresses of telemetry regions belonging to this group
* @guid: Unique number per XML description file.
* @mmio_size: Number of bytes of MMIO registers for this group.
+ * @num_events: Number of events in this group.
+ * @evts: Array of event descriptors.
*/
struct event_group {
/* Data fields used by this code. */
@@ -51,6 +67,8 @@ struct event_group {
/* Remaining fields initialized from XML file. */
u32 guid;
size_t mmio_size;
+ int num_events;
+ struct pmt_event evts[] __counted_by(num_events);
};
/*
@@ -60,6 +78,11 @@ struct event_group {
static struct event_group energy_0x26696143 = {
.guid = 0x26696143,
.mmio_size = (576 * 2 + 3) * 8,
+ .num_events = 2,
+ .evts = {
+ EVT(PMT_EVENT_ENERGY, 0, 18),
+ EVT(PMT_EVENT_ACTIVITY, 1, 18),
+ }
};
/*
@@ -69,6 +92,16 @@ static struct event_group energy_0x26696143 = {
static struct event_group perf_0x26557651 = {
.guid = 0x26557651,
.mmio_size = (576 * 7 + 3) * 8,
+ .num_events = 7,
+ .evts = {
+ EVT(PMT_EVENT_STALLS_LLC_HIT, 0, 0),
+ EVT(PMT_EVENT_C1_RES, 1, 0),
+ EVT(PMT_EVENT_UNHALTED_CORE_CYCLES, 2, 0),
+ EVT(PMT_EVENT_STALLS_LLC_MISS, 3, 0),
+ EVT(PMT_EVENT_AUTO_C6_RES, 4, 0),
+ EVT(PMT_EVENT_UNHALTED_REF_CYCLES, 5, 0),
+ EVT(PMT_EVENT_UOPS_RETIRED, 6, 0),
+ }
};
static struct event_group *known_event_groups[] = {
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index f554d7933739..f24a568f7b67 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -876,6 +876,51 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
.evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
.rid = RDT_RESOURCE_L3,
},
+ [PMT_EVENT_ENERGY] = {
+ .name = "core_energy",
+ .evtid = PMT_EVENT_ENERGY,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_ACTIVITY] = {
+ .name = "activity",
+ .evtid = PMT_EVENT_ACTIVITY,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_STALLS_LLC_HIT] = {
+ .name = "stalls_llc_hit",
+ .evtid = PMT_EVENT_STALLS_LLC_HIT,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_C1_RES] = {
+ .name = "c1_res",
+ .evtid = PMT_EVENT_C1_RES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_UNHALTED_CORE_CYCLES] = {
+ .name = "unhalted_core_cycles",
+ .evtid = PMT_EVENT_UNHALTED_CORE_CYCLES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_STALLS_LLC_MISS] = {
+ .name = "stalls_llc_miss",
+ .evtid = PMT_EVENT_STALLS_LLC_MISS,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_AUTO_C6_RES] = {
+ .name = "c6_res",
+ .evtid = PMT_EVENT_AUTO_C6_RES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_UNHALTED_REF_CYCLES] = {
+ .name = "unhalted_ref_cycles",
+ .evtid = PMT_EVENT_UNHALTED_REF_CYCLES,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
+ [PMT_EVENT_UOPS_RETIRED] = {
+ .name = "uops_retired",
+ .evtid = PMT_EVENT_UOPS_RETIRED,
+ .rid = RDT_RESOURCE_PERF_PKG,
+ },
};
void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu, u32 binary_bits)
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 21/29] x86/resctrl: x86/resctrl: Read core telemetry events
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (19 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 4:02 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
` (9 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The resctrl file system passes requests to read event monitor files to
the architecture resctrl_arch_rmid_read() to collect values
from hardware counters.
Use the resctrl resource to differentiate between calls to read legacy
L3 events from the new telemetry events (which are attached to
RDT_RESOURCE_PERF_PKG).
There may be multiple aggregators tracking each package, so scan all of
them and add up all counters.
At run time when a user reads an event file the file system code
provides the enum resctrl_event_id for the event.
Create a lookup table indexed by event id to provide the telem_entry
structure and the event index into MMIO space.
Enable the events marked as readable from any CPU.
Resctrl now uses readq() so depends on X86_64. Update Kconfig.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/intel_aet.c | 53 +++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 6 +++
arch/x86/Kconfig | 2 +-
4 files changed, 61 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 2b2d4b5a4643..42da0a222c7c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -169,5 +169,6 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
+int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val);
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index bf8e2a6126d2..be52c9302a80 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -13,6 +13,7 @@
#include <linux/cleanup.h>
#include <linux/cpu.h>
+#include <linux/io.h>
#include <linux/resctrl.h>
#include <linux/slab.h>
@@ -128,6 +129,16 @@ static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
return false;
}
+/**
+ * struct evtinfo - lookup table from resctrl_event_id to useful information
+ * @event_group: Pointer to the telem_entry structure for this event
+ * @idx: Counter index within each per-RMID block of counters
+ */
+static struct evtinfo {
+ struct event_group *event_group;
+ int idx;
+} evtinfo[QOS_NUM_EVENTS];
+
static void free_mmio_info(struct mmio_info **mmi)
{
int num_pkgs = topology_max_packages();
@@ -199,6 +210,15 @@ static int configure_events(struct event_group *e, struct pmt_feature_group *p)
}
e->pkginfo = no_free_ptr(pkginfo);
+ for (int i = 0; i < e->num_events; i++) {
+ enum resctrl_event_id evt;
+
+ evt = e->evts[i].evtid;
+ evtinfo[evt].event_group = e;
+ evtinfo[evt].idx = e->evts[i].evt_idx;
+ resctrl_enable_mon_event(evt, true, e->evts[i].bin_bits);
+ }
+
return 0;
}
@@ -266,3 +286,36 @@ void __exit intel_aet_exit(void)
free_mmio_info((*peg)->pkginfo);
}
}
+
+#define DATA_VALID BIT_ULL(63)
+#define DATA_BITS GENMASK_ULL(62, 0)
+
+/*
+ * Read counter for an event on a domain (summing all aggregators
+ * on the domain).
+ */
+int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val)
+{
+ struct evtinfo *info = &evtinfo[evtid];
+ struct mmio_info *mmi;
+ u64 evtcount;
+ int idx;
+
+ idx = rmid * info->event_group->num_events;
+ idx += info->idx;
+ mmi = info->event_group->pkginfo[domid];
+
+ if (idx * sizeof(u64) + sizeof(u64) > info->event_group->mmio_size) {
+ pr_warn_once("MMIO index %d out of range\n", idx);
+ return -EIO;
+ }
+
+ for (int i = 0; i < mmi->count; i++) {
+ evtcount = readq(mmi->addrs[i] + idx * sizeof(u64));
+ if (!(evtcount & DATA_VALID))
+ return -EINVAL;
+ *val += evtcount & DATA_BITS;
+ }
+
+ return 0;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 1f6dc253112f..c99aa9dacfd8 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -230,6 +230,12 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
resctrl_arch_rmid_read_context_check();
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ return intel_aet_read_event(d->hdr.id, rmid, eventid, val);
+
+ if (r->rid != RDT_RESOURCE_L3)
+ return -EIO;
+
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
ret = __rmid_read_phys(prmid, eventid, &msr_val);
if (ret)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 52cfb69c343f..24df3f04a115 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -506,7 +506,7 @@ config X86_MPPARSE
config X86_CPU_RESCTRL
bool "x86 CPU resource control support"
- depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
+ depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
depends on MISC_FILESYSTEMS
select ARCH_HAS_CPU_RESCTRL
select RESCTRL_FS
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (20 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 21/29] x86/resctrl: x86/resctrl: Read core telemetry events Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 4:06 ` Reinette Chatre
2025-06-07 0:54 ` Fenghua Yu
2025-05-21 22:50 ` [PATCH v5 23/29] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
` (8 subsequent siblings)
30 siblings, 2 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The L3 resource has several requirements for domains. There are structures
that hold the 64-bit values of counters, and elements to keep track of
the overflow and limbo threads.
None of these are needed for the PERF_PKG resource. The hardware counters
are wide enough that they do not wrap around for decades.
Define a new rdt_perf_pkg_mon_domain structure which just consists of
the standard rdt_domain_hdr to keep track of domain id and CPU mask.
Change domain_add_cpu_mon(), domain_remove_cpu_mon(),
resctrl_offline_mon_domain(), and resctrl_online_mon_domain() to check
resource type and perform only the operations needed for domains in the
PERF_PKG resource.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 41 ++++++++++++++++++++++++++++++
fs/resctrl/rdtgroup.c | 4 +++
2 files changed, 45 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 64ce561e77a0..18d84c497ee4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -540,6 +540,38 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
}
}
+/**
+ * struct rdt_perf_pkg_mon_domain - CPUs sharing an Intel-PMT-scoped resctrl monitor resource
+ * @hdr: common header for different domain types
+ */
+struct rdt_perf_pkg_mon_domain {
+ struct rdt_domain_hdr hdr;
+};
+
+static void setup_intel_aet_mon_domain(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos)
+{
+ struct rdt_perf_pkg_mon_domain *d;
+ int err;
+
+ d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
+ if (!d)
+ return;
+
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = r->rid;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ list_add_tail_rcu(&d->hdr.list, add_pos);
+
+ err = resctrl_online_mon_domain(r, &d->hdr);
+ if (err) {
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ kfree(d);
+ }
+}
+
static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
@@ -567,6 +599,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
case RDT_RESOURCE_L3:
l3_mon_domain_setup(cpu, id, r, add_pos);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ setup_intel_aet_mon_domain(cpu, id, r, add_pos);
+ break;
default:
WARN_ON_ONCE(1);
}
@@ -666,6 +701,12 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
default:
pr_warn_once("Unknown resource rid=%d\n", r->rid);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ resctrl_offline_mon_domain(r, hdr);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
+ kfree(container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr));
+ break;
}
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 1e1cc8001cbc..6078cdd5cad0 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4170,6 +4170,8 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
goto out_unlock;
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ goto do_mkdir;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
err = domain_setup_l3_mon_state(r, d);
if (err)
@@ -4184,6 +4186,8 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
+do_mkdir:
+ err = 0;
/*
* If the filesystem is not mounted then only the default resource group
* exists. Creation of its directories is deferred until mount time
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 23/29] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (21 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-21 22:50 ` [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
` (7 subsequent siblings)
30 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The RDT_RESOURCE_PERF_PKG resource is not marked as "mon_capable" during
early resctrl initialization. This means that the domain lists for the
resource are not built when the CPU hot plug notifiers are registered.
Mark the resource as mon_capable and call domain_add_cpu_mon() for
each online CPU to build the domain lists in the first call to the
resctrl_arch_pre_mount() hook.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 18d84c497ee4..f07f5b58639a 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -763,14 +763,27 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
void resctrl_arch_pre_mount(void)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
static atomic_t only_once;
- int old = 0;
+ int cpu, old = 0;
if (!atomic_try_cmpxchg(&only_once, &old, 1))
return;
if (!intel_aet_get_events())
return;
+
+ /*
+ * Late discovery of telemetry events means the domains for the
+ * resource were not built. Do that now.
+ */
+ cpus_read_lock();
+ mutex_lock(&domain_list_lock);
+ r->mon_capable = true;
+ for_each_online_cpu(cpu)
+ domain_add_cpu_mon(cpu, r);
+ mutex_unlock(&domain_list_lock);
+ cpus_read_unlock();
}
enum {
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (22 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 23/29] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 4:10 ` Reinette Chatre
2025-06-06 23:55 ` Fenghua Yu
2025-05-21 22:50 ` [PATCH v5 25/29] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
` (6 subsequent siblings)
30 siblings, 2 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Users may want to force either of the telemetry features on
(in the case where they are disabled due to erratum) or off
(in the case that a limited number of RMIDs for a telemetry
feature reduces the number of monitor groups that can be
created.)
Unlike other options that are tied to X86_FEATURE_* flags,
these must be queried by name. Add a function to do that.
Add checks for users who forced either feature off.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
.../admin-guide/kernel-parameters.txt | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 4 +++
arch/x86/kernel/cpu/resctrl/core.c | 28 +++++++++++++++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 6 ++++
4 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d9fd26b95b34..4811bc812f0f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5988,7 +5988,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec.
+ mba, smba, bmec, energy, perf.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 42da0a222c7c..524f3c183900 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -167,6 +167,10 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
+bool rdt_is_option_force_enabled(char *option);
+
+bool rdt_is_option_force_disabled(char *option);
+
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f07f5b58639a..b23309566500 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -797,6 +797,8 @@ enum {
RDT_FLAG_MBA,
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
+ RDT_FLAG_ENERGY,
+ RDT_FLAG_PERF,
};
#define RDT_OPT(idx, n, f) \
@@ -822,6 +824,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
+ RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
+ RDT_OPT(RDT_FLAG_PERF, "perf", 0),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
@@ -871,6 +875,30 @@ bool rdt_cpu_has(int flag)
return ret;
}
+bool rdt_is_option_force_enabled(char *name)
+{
+ struct rdt_options *o;
+
+ for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
+ if (!strcmp(name, o->name))
+ return o->force_on;
+ }
+
+ return false;
+}
+
+bool rdt_is_option_force_disabled(char *name)
+{
+ struct rdt_options *o;
+
+ for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
+ if (!strcmp(name, o->name))
+ return o->force_off;
+ }
+
+ return false;
+}
+
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt)
{
if (!rdt_cpu_has(X86_FEATURE_BMEC))
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index be52c9302a80..c1fc85dbf0d8 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -51,6 +51,7 @@ struct pmt_event {
/**
* struct event_group - All information about a group of telemetry events.
+ * @name: Name for this group (used by boot rdt= option)
* @pfg: Points to the aggregated telemetry space information
* within the OOBMSM driver that contains data for all
* telemetry regions.
@@ -62,6 +63,7 @@ struct pmt_event {
*/
struct event_group {
/* Data fields used by this code. */
+ char *name;
struct pmt_feature_group *pfg;
struct mmio_info **pkginfo;
@@ -77,6 +79,7 @@ struct event_group {
* File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
*/
static struct event_group energy_0x26696143 = {
+ .name = "energy",
.guid = 0x26696143,
.mmio_size = (576 * 2 + 3) * 8,
.num_events = 2,
@@ -91,6 +94,7 @@ static struct event_group energy_0x26696143 = {
* File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
*/
static struct event_group perf_0x26557651 = {
+ .name = "perf",
.guid = 0x26557651,
.mmio_size = (576 * 7 + 3) * 8,
.num_events = 7,
@@ -247,6 +251,8 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
for (int i = 0; i < p->count; i++) {
if ((*peg)->guid == p->regions[i].guid) {
+ if (rdt_is_option_force_disabled((*peg)->name))
+ return false;
ret = configure_events(*peg, p);
if (!ret) {
(*peg)->pfg = no_free_ptr(p);
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 25/29] x86/resctrl: Handle number of RMIDs supported by telemetry resources
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (23 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 4:13 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 26/29] x86,fs/resctrl: Move RMID initialization to first mount Tony Luck
` (5 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There are now three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF. This
is the maximum number of distinct values that can be loaded into the
IA32_PQR_ASSOC MSR. Note that systems with Sub-NUMA Cluster mode enabled
will force scaling down the CPUID enumerated value by the number of SNC
nodes per L3-cache.
2) The number of registers in MMIO space for each event. This
is enumerated in the XML files and is the value initialized into
event_group::num_rmids. This will be overwritten with a lower
value if hardware does not support all these registers at the
same time (see next case).
3) The number of "h/w counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that
does describe the limitations) feeding to those MMIO registers. This
is enumerated in telemetry_region::num_rmids returned from the call to
intel_pmt_get_regions_by_feature()
Event groups with insufficient "h/w counter" to track all RMIDs are
difficult for users to use, since the system may reassign "h/w counters"
as any time. This means that users cannot reliably collect two consecutive
event counts to compute the rate at which events are occurring.
Ignore such under-resourced event groups unless the user explicitly
requests to enable them using the "rdt=" Linux boot argument.
Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG
resource "num_rmids" value to the smallest of these values to ensure
that all resctrl groups have equal monitor capabilities.
N.B. Changed type of rdt_resource::num_rmids to u32 to match.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 27 +++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 2 ++
4 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 4ba51cb598e1..b7e15abcde23 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -286,7 +286,7 @@ struct rdt_resource {
int rid;
bool alloc_capable;
bool mon_capable;
- int num_rmid;
+ u32 num_rmid;
enum resctrl_scope ctrl_scope;
enum resctrl_scope mon_scope;
struct resctrl_cache cache;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 524f3c183900..795534b9b9d2 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -18,6 +18,8 @@
#define RMID_VAL_UNAVAIL BIT_ULL(62)
+extern int rdt_num_system_rmids;
+
/*
* With the above fields in use 62 bits remain in MSR_IA32_QM_CTR for
* data to be returned. The counter width is discovered from the hardware
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index c1fc85dbf0d8..1b41167ad976 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -14,6 +14,7 @@
#include <linux/cleanup.h>
#include <linux/cpu.h>
#include <linux/io.h>
+#include <linux/minmax.h>
#include <linux/resctrl.h>
#include <linux/slab.h>
@@ -57,6 +58,9 @@ struct pmt_event {
* telemetry regions.
* @pkginfo: Per-package MMIO addresses of telemetry regions belonging to this group
* @guid: Unique number per XML description file.
+ * @num_rmids: Number of RMIDS supported by this group. Will be adjusted downwards
+ * if enumeration from intel_pmt_get_regions_by_feature() indicates
+ * fewer RMIDs can be tracked simultaneously.
* @mmio_size: Number of bytes of MMIO registers for this group.
* @num_events: Number of events in this group.
* @evts: Array of event descriptors.
@@ -69,6 +73,7 @@ struct event_group {
/* Remaining fields initialized from XML file. */
u32 guid;
+ u32 num_rmids;
size_t mmio_size;
int num_events;
struct pmt_event evts[] __counted_by(num_events);
@@ -81,6 +86,7 @@ struct event_group {
static struct event_group energy_0x26696143 = {
.name = "energy",
.guid = 0x26696143,
+ .num_rmids = 576,
.mmio_size = (576 * 2 + 3) * 8,
.num_events = 2,
.evts = {
@@ -96,6 +102,7 @@ static struct event_group energy_0x26696143 = {
static struct event_group perf_0x26557651 = {
.name = "perf",
.guid = 0x26557651,
+ .num_rmids = 576,
.mmio_size = (576 * 7 + 3) * 8,
.num_events = 7,
.evts = {
@@ -253,6 +260,15 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
if ((*peg)->guid == p->regions[i].guid) {
if (rdt_is_option_force_disabled((*peg)->name))
return false;
+ /*
+ * Ignore event group with insufficient RMIDs unless the
+ * user used the rdt= boot option to specifically ask
+ * for it to be enabled.
+ */
+ if (p->regions[i].num_rmids < rdt_num_system_rmids &&
+ !rdt_is_option_force_enabled((*peg)->name))
+ return false;
+ (*peg)->num_rmids = min((*peg)->num_rmids, p->regions[i].num_rmids);
ret = configure_events(*peg, p);
if (!ret) {
(*peg)->pfg = no_free_ptr(p);
@@ -272,11 +288,22 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
*/
bool intel_aet_get_events(void)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
+ struct event_group **eg;
bool ret1, ret2;
ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM);
ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM);
+ for (eg = &known_event_groups[0]; eg < &known_event_groups[NUM_KNOWN_GROUPS]; eg++) {
+ if (!(*eg)->pfg)
+ continue;
+ if (r->num_rmid)
+ r->num_rmid = min(r->num_rmid, (*eg)->num_rmids);
+ else
+ r->num_rmid = (*eg)->num_rmids;
+ }
+
return ret1 || ret2;
}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c99aa9dacfd8..9cd37be262a2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -32,6 +32,7 @@ bool rdt_mon_capable;
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
+int rdt_num_system_rmids;
static int snc_nodes_per_l3_cache = 1;
/*
@@ -350,6 +351,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
+ rdt_num_system_rmids = r->num_rmid;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 26/29] x86,fs/resctrl: Move RMID initialization to first mount
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (24 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 25/29] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-21 22:50 ` [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file Tony Luck
` (4 subsequent siblings)
30 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The resctrl file system code assumed that the only monitor events were
tied to the RDT_RESOURCE_L3 resource. Also that the number of supported
RMIDs was enumerated during early initialization.
RDT_RESOURCE_PERF_PKG breaks both of those assumptions.
Delay the final enumeration of the number of RMIDs and subsequent
allocation of structures until first mount of the resctrl file system.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 4 ++-
arch/x86/kernel/cpu/resctrl/core.c | 8 +++--
fs/resctrl/monitor.c | 58 +++++++++++++-----------------
fs/resctrl/rdtgroup.c | 12 +++++--
4 files changed, 42 insertions(+), 40 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 64c1c226d676..1f4800cfcd6a 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -348,6 +348,8 @@ int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
+int resctrl_mon_dom_data_init(void);
+
void resctrl_mon_resource_exit(void);
void mon_event_count(void *info);
@@ -358,7 +360,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, struct mon_evt *evt, int first);
-int resctrl_mon_resource_init(void);
+void resctrl_mon_l3_resource_init(void);
void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b23309566500..8a9ceb03e252 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -111,10 +111,14 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
u32 resctrl_arch_system_num_rmid_idx(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ u32 num_rmids = U32_MAX;
+ struct rdt_resource *r;
+
+ for_each_mon_capable_rdt_resource(r)
+ num_rmids = min(num_rmids, r->num_rmid);
/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
- return r->num_rmid;
+ return num_rmids;
}
struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index f24a568f7b67..6041cb304624 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -775,15 +775,27 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
-static int dom_data_init(struct rdt_resource *r)
+/*
+ * resctrl_dom_data_init() - Initialise global monitoring structures.
+ *
+ * Allocate and initialise global monitor resources that do not belong to a
+ * specific domain. i.e. the rmid_ptrs[] used for the limbo and free lists.
+ * Called once during boot after the struct rdt_resource's have been configured
+ * but before the filesystem is mounted.
+ * Resctrl's cpuhp callbacks may be called before this point to bring a domain
+ * online.
+ *
+ * Returns 0 for success, or -ENOMEM.
+ */
+int resctrl_mon_dom_data_init(void)
{
+ struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
u32 num_closid = resctrl_arch_get_num_closid(r);
struct rmid_entry *entry = NULL;
- int err = 0, i;
u32 idx;
+ int i;
- mutex_lock(&rdtgroup_mutex);
if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
u32 *tmp;
@@ -794,10 +806,8 @@ static int dom_data_init(struct rdt_resource *r)
* use.
*/
tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL);
- if (!tmp) {
- err = -ENOMEM;
- goto out_unlock;
- }
+ if (!tmp)
+ return -ENOMEM;
closid_num_dirty_rmid = tmp;
}
@@ -808,8 +818,7 @@ static int dom_data_init(struct rdt_resource *r)
kfree(closid_num_dirty_rmid);
closid_num_dirty_rmid = NULL;
}
- err = -ENOMEM;
- goto out_unlock;
+ return -ENOMEM;
}
for (i = 0; i < idx_limit; i++) {
@@ -830,13 +839,10 @@ static int dom_data_init(struct rdt_resource *r)
entry = __rmid_entry(idx);
list_del(&entry->list);
-out_unlock:
- mutex_unlock(&rdtgroup_mutex);
-
- return err;
+ return 0;
}
-static void dom_data_exit(struct rdt_resource *r)
+static void resctrl_mon_dom_data_exit(struct rdt_resource *r)
{
mutex_lock(&rdtgroup_mutex);
@@ -943,28 +949,14 @@ bool resctrl_is_mon_event_enabled(enum resctrl_event_id evtid)
}
/**
- * resctrl_mon_resource_init() - Initialise global monitoring structures.
- *
- * Allocate and initialise global monitor resources that do not belong to a
- * specific domain. i.e. the rmid_ptrs[] used for the limbo and free lists.
- * Called once during boot after the struct rdt_resource's have been configured
- * but before the filesystem is mounted.
- * Resctrl's cpuhp callbacks may be called before this point to bring a domain
- * online.
- *
- * Returns 0 for success, or -ENOMEM.
+ * resctrl_mon_l3_resource_init() - Initialise L3 configuration options.
*/
-int resctrl_mon_resource_init(void)
+void resctrl_mon_l3_resource_init(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- int ret;
if (!r->mon_capable)
- return 0;
-
- ret = dom_data_init(r);
- if (ret)
- return ret;
+ return;
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
@@ -981,13 +973,11 @@ int resctrl_mon_resource_init(void)
mba_mbps_default_event = QOS_L3_MBM_LOCAL_EVENT_ID;
else if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
-
- return 0;
}
void resctrl_mon_resource_exit(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- dom_data_exit(r);
+ resctrl_mon_dom_data_exit(r);
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 6078cdd5cad0..e212e46e0780 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2583,6 +2583,7 @@ static int rdt_get_tree(struct fs_context *fc)
unsigned long flags = RFTYPE_CTRL_BASE;
struct rdt_l3_mon_domain *dom;
struct rdt_resource *r;
+ static bool once;
int ret;
resctrl_arch_pre_mount();
@@ -2597,6 +2598,13 @@ static int rdt_get_tree(struct fs_context *fc)
goto out;
}
+ if (resctrl_arch_mon_capable() && !once) {
+ ret = resctrl_mon_dom_data_init();
+ if (ret)
+ goto out;
+ once = true;
+ }
+
ret = rdtgroup_setup_root(ctx);
if (ret)
goto out;
@@ -4290,9 +4298,7 @@ int resctrl_init(void)
thread_throttle_mode_init();
- ret = resctrl_mon_resource_init();
- if (ret)
- return ret;
+ resctrl_mon_l3_resource_init();
ret = sysfs_create_mount_point(fs_kobj, "resctrl");
if (ret) {
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (25 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 26/29] x86,fs/resctrl: Move RMID initialization to first mount Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-06-04 4:15 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 28/29] x86/resctrl: Add info/PERF_PKG_MON/status file Tony Luck
` (3 subsequent siblings)
30 siblings, 1 reply; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Creation of all files in the resctrl file system is under control of
the file system layer.
But some resources may need to add a file to the info/{resource}
directory for debug purposes.
Add a new rdt_resource::info_file field for the resource to specify
show() and/or write() operations. These will be called with the
rdtgroup_mutex held.
Architecture can note the file is only for debug using by setting
the rftype::flags RFTYPE_DEBUG bit.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 33 +++++++++++++++++++++++++++
fs/resctrl/internal.h | 31 ++-----------------------
fs/resctrl/rdtgroup.c | 50 ++++++++++++++++++++++++++++++++++++++---
3 files changed, 82 insertions(+), 32 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index b7e15abcde23..e067007c633c 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -73,6 +73,37 @@ enum resctrl_conf_type {
#define CDP_NUM_TYPES (CDP_DATA + 1)
+/**
+ * struct rftype - describe each file in the resctrl file system
+ * @name: File name
+ * @mode: Access mode
+ * @kf_ops: File operations
+ * @flags: File specific RFTYPE_FLAGS_* flags
+ * @fflags: File specific RFTYPE_* flags
+ * @seq_show: Show content of the file
+ * @write: Write to the file
+ */
+struct rftype {
+ char *name;
+ umode_t mode;
+ const struct kernfs_ops *kf_ops;
+ unsigned long flags;
+ unsigned long fflags;
+
+ int (*seq_show)(struct kernfs_open_file *of,
+ struct seq_file *sf, void *v);
+ /*
+ * write() is the generic write callback which maps directly to
+ * kernfs write operation and overrides all other operations.
+ * Maximum write size is determined by ->max_write_len.
+ */
+ ssize_t (*write)(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off);
+};
+
+/* Only rftype::flags option available to architecture code */
+#define RFTYPE_DEBUG BIT(10)
+
/*
* struct pseudo_lock_region - pseudo-lock region information
* @s: Resctrl schema for the resource to which this
@@ -281,6 +312,7 @@ enum resctrl_schema_fmt {
* @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth
* monitoring events can be configured.
* @cdp_capable: Is the CDP feature available on this resource
+ * @info_file: Optional per-resource debug info file
*/
struct rdt_resource {
int rid;
@@ -297,6 +329,7 @@ struct rdt_resource {
enum resctrl_schema_fmt schema_fmt;
unsigned int mbm_cfg_mask;
bool cdp_capable;
+ struct rftype info_file;
};
/*
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 1f4800cfcd6a..f13b63804c1a 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -232,7 +232,8 @@ struct rdtgroup {
#define RFTYPE_RES_MB BIT(9)
-#define RFTYPE_DEBUG BIT(10)
+// RFTYPE_DEBUG available to architecture code in <linux/resctrl.h>
+//#define RFTYPE_DEBUG BIT(10)
#define RFTYPE_RES_PERF_PKG BIT(11)
@@ -251,34 +252,6 @@ extern struct list_head rdt_all_groups;
extern int max_name_width;
-/**
- * struct rftype - describe each file in the resctrl file system
- * @name: File name
- * @mode: Access mode
- * @kf_ops: File operations
- * @flags: File specific RFTYPE_FLAGS_* flags
- * @fflags: File specific RFTYPE_* flags
- * @seq_show: Show content of the file
- * @write: Write to the file
- */
-struct rftype {
- char *name;
- umode_t mode;
- const struct kernfs_ops *kf_ops;
- unsigned long flags;
- unsigned long fflags;
-
- int (*seq_show)(struct kernfs_open_file *of,
- struct seq_file *sf, void *v);
- /*
- * write() is the generic write callback which maps directly to
- * kernfs write operation and overrides all other operations.
- * Maximum write size is determined by ->max_write_len.
- */
- ssize_t (*write)(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off);
-};
-
/**
* struct mbm_state - status for each MBM counter in each domain
* @prev_bw_bytes: Previous bytes value read for bandwidth calculation
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index e212e46e0780..f09674c209f8 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -329,6 +329,37 @@ static const struct kernfs_ops rdtgroup_kf_single_ops = {
.seq_show = rdtgroup_seqfile_show,
};
+static int rdtgroup_seqfile_show_locked(struct seq_file *m, void *arg)
+{
+ struct kernfs_open_file *of = m->private;
+ struct rftype *rft = of->kn->priv;
+
+ guard(mutex)(&rdtgroup_mutex);
+
+ if (rft->seq_show)
+ return rft->seq_show(of, m, arg);
+ return 0;
+}
+
+static ssize_t rdtgroup_file_write_locked(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct rftype *rft = of->kn->priv;
+
+ guard(mutex)(&rdtgroup_mutex);
+
+ if (rft->write)
+ return rft->write(of, buf, nbytes, off);
+
+ return -EINVAL;
+}
+
+static const struct kernfs_ops rdtgroup_kf_single_locked_ops = {
+ .atomic_write_len = PAGE_SIZE,
+ .write = rdtgroup_file_write_locked,
+ .seq_show = rdtgroup_seqfile_show_locked,
+};
+
static const struct kernfs_ops kf_mondata_ops = {
.atomic_write_len = PAGE_SIZE,
.seq_show = rdtgroup_mondata_show,
@@ -2162,7 +2193,7 @@ int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
return ret;
}
-static int rdtgroup_mkdir_info_resdir(void *priv, char *name,
+static int rdtgroup_mkdir_info_resdir(struct rdt_resource *r, void *priv, char *name,
unsigned long fflags)
{
struct kernfs_node *kn_subdir;
@@ -2177,6 +2208,19 @@ static int rdtgroup_mkdir_info_resdir(void *priv, char *name,
if (ret)
return ret;
+ if (r->info_file.name &&
+ (!(r->info_file.flags & RFTYPE_DEBUG) || resctrl_debug)) {
+ r->info_file.mode = 0;
+ if (r->info_file.seq_show)
+ r->info_file.mode |= 0444;
+ if (r->info_file.write)
+ r->info_file.mode |= 0200;
+ r->info_file.kf_ops = &rdtgroup_kf_single_locked_ops;
+ ret = rdtgroup_add_file(kn_subdir, &r->info_file);
+ if (ret)
+ return ret;
+ }
+
ret = rdtgroup_add_files(kn_subdir, fflags);
if (!ret)
kernfs_activate(kn_subdir);
@@ -2221,7 +2265,7 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
fflags = fflags_from_resource(r) | RFTYPE_CTRL_INFO;
- ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags);
+ ret = rdtgroup_mkdir_info_resdir(r, s, s->name, fflags);
if (ret)
goto out_destroy;
}
@@ -2229,7 +2273,7 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
for_each_mon_capable_rdt_resource(r) {
fflags = fflags_from_resource(r) | RFTYPE_MON_INFO;
sprintf(name, "%s_MON", r->name);
- ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
+ ret = rdtgroup_mkdir_info_resdir(r, r, name, fflags);
if (ret)
goto out_destroy;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 28/29] x86/resctrl: Add info/PERF_PKG_MON/status file
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (26 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-21 22:50 ` [PATCH v5 29/29] x86/resctrl: Update Documentation for package events Tony Luck
` (2 subsequent siblings)
30 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each telemetry aggregator provides three status registers at the top
end of MMIO space after all the per-RMID per-event counters:
agg_data_loss_count: This counts the number of times that this aggregator
failed to accumulate a counter value supplied by a CPU core.
agg_data_loss_timestamp: This is a "timestamp" from a free running
25MHz uncore timer indicating when the most recent data loss occurred.
last_update_timestamp: Another 25MHz timestamp indicating when the
most recent counter update was successfully applied.
When the resctrl file system is mounted with the "-o debug" option
display the values of each of these status registers for each aggregator
in each enabled event group.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 34 +++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 1b41167ad976..459e42459178 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -16,6 +16,7 @@
#include <linux/io.h>
#include <linux/minmax.h>
#include <linux/resctrl.h>
+#include <linux/seq_file.h>
#include <linux/slab.h>
/* Temporary - delete from final version */
@@ -282,6 +283,35 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
return false;
}
+static void show_debug(struct seq_file *s, struct event_group *e, int pkg, int instance)
+{
+ void __iomem *info = e->pkginfo[pkg]->addrs[instance] + e->mmio_size;
+
+ /* Information registers are the last three qwords in MMIO space */
+ seq_printf(s, "%s %d:%d agg_data_loss_count = %llu\n", e->name, pkg, instance,
+ readq(info - 24));
+ seq_printf(s, "%s %d:%d agg_data_loss_timestamp = %llu\n", e->name, pkg, instance,
+ readq(info - 16));
+ seq_printf(s, "%s %d:%d last_update_timestamp = %llu\n", e->name, pkg, instance,
+ readq(info - 8));
+}
+
+static int info_status(struct kernfs_open_file *of, struct seq_file *s, void *v)
+{
+ int num_pkgs = topology_max_packages();
+ struct event_group **eg;
+
+ for (eg = &known_event_groups[0]; eg < &known_event_groups[NUM_KNOWN_GROUPS]; eg++) {
+ if (!(*eg)->pfg)
+ continue;
+ for (int i = 0; i < num_pkgs; i++)
+ for (int j = 0; j < (*eg)->pkginfo[i]->count; j++)
+ show_debug(s, *eg, i, j);
+ }
+
+ return 0;
+}
+
/*
* Ask OOBMSM discovery driver for all the RMID based telemetry groups
* that it supports.
@@ -304,6 +334,10 @@ bool intel_aet_get_events(void)
r->num_rmid = (*eg)->num_rmids;
}
+ r->info_file.name = "status";
+ r->info_file.seq_show = info_status;
+ r->info_file.flags = RFTYPE_DEBUG;
+
return ret1 || ret2;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* [PATCH v5 29/29] x86/resctrl: Update Documentation for package events
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (27 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 28/29] x86/resctrl: Add info/PERF_PKG_MON/status file Tony Luck
@ 2025-05-21 22:50 ` Tony Luck
2025-05-28 17:21 ` [PATCH v5 00/29] x86/resctrl telemetry monitoring Reinette Chatre
2025-06-13 16:57 ` James Morse
30 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-21 22:50 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each "mon_data" directory is now divided between L3 events and package
events.
The "info/PERF_PKG_MON" directory contains parameters for perf events.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Documentation/filesystems/resctrl.rst | 53 ++++++++++++++++++++++-----
1 file changed, 43 insertions(+), 10 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index c7949dd44f2f..a452fd54b3ae 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -167,7 +167,7 @@ with respect to allocation:
bandwidth percentages are directly applied to
the threads running on the core
-If RDT monitoring is available there will be an "L3_MON" directory
+If L3 monitoring is available there will be an "L3_MON" directory
with the following files:
"num_rmids":
@@ -261,6 +261,23 @@ with the following files:
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
+If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of telemetry RMIDs supported. If this is different
+ from the number reported in the L3_MON directory the limit
+ on the number of "CTRL_MON" + "MON" directories is the
+ minimum of the values.
+
+"mon_features":
+ Lists the telemetry monitoring events that are enabled on this system.
+
+When the filesystem is mounted with the debug option each subdirectory
+for a monitor resource of the "info" directory will contain a "status"
+file. Resources may use this to supply debug information about the status
+of the hardware implementing the resource.
+
Finally, in the top level of the "info" directory there is a file
named "last_cmd_status". This is reset with every "command" issued
via the file system (making new directories or writing to any of the
@@ -366,15 +383,31 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:
"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
- "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
- files provide a read out of the current value of the event for
- all tasks in the group. In CTRL_MON groups these files provide
- the sum for all tasks in the CTRL_MON group and all tasks in
- MON groups. Please see example section for more details on usage.
+ This contains a set of directories, one for each instance
+ of an L3 cache, or of a processor package. The L3 cache
+ directories are named "mon_L3_00", "mon_L3_01" etc. The
+ package directories "mon_PERF_PKG_00", "mon_PERF_PKG_01" etc.
+
+ Within each directory there is one file per event. In
+ the L3 directories: "llc_occupancy", "mbm_total_bytes",
+ and "mbm_local_bytes". In the PERF_PKG directories: "core_energy",
+ "activity", etc.
+
+ "core_energy" reports a floating point number for the energy
+ (in Joules) used by cores for each RMID.
+
+ "activity" also reports a floating point value (in Farads).
+ This provides an estimate of work done independent of the
+ frequency that the cores used for execution.
+
+ All other events report decimal integer values.
+
+ In a MON group these files provide a read out of the current
+ value of the event for all tasks in the group. In CTRL_MON groups
+ these files provide the sum for all tasks in the CTRL_MON group
+ and all tasks in MON groups. Please see example section for more
+ details on usage.
+
On systems with Sub-NUMA Cluster (SNC) enabled there are extra
directories for each node (located within the "mon_L3_XX" directory
for the L3 cache they occupy). These are named "mon_sub_L3_YY"
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* Re: [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr
2025-05-21 22:50 ` [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr Tony Luck
@ 2025-05-22 0:01 ` Keshavamurthy, Anil S
2025-05-22 0:15 ` Luck, Tony
2025-06-04 3:37 ` Reinette Chatre
2025-06-07 0:52 ` Fenghua Yu
2 siblings, 1 reply; 90+ messages in thread
From: Keshavamurthy, Anil S @ 2025-05-22 0:01 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman,
Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
Chen Yu
Cc: x86, linux-kernel, patches, anil.s.keshavamurthy
Hi Tony,
On 5/21/2025 3:50 PM, Tony Luck wrote:
> Historically all monitoring events have been associated with the L3
> resource and it made sense to use "struct rdt_mon_domain *" arguments
> to functions manipulating domains. But the addition of monitor events
> tied to other resources changes this assumption.
>
> Some functionality like:
> *) adding a CPU to an existing domain
> *) removing a CPU that is not the last one from a domain
> can be achieved with just access to the rdt_domain_hdr structure.
>
> Change arguments from "rdt_*_domain" to rdt_domain_hdr so functions
> can be used on domains from any resource.
>
> Add sanity checks where container_of() is used to find the surrounding
> domain structure that hdr has the expected type.
>
> Simplify code that uses "d->hdr." to "hdr->" where possible.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 4 +-
> arch/x86/kernel/cpu/resctrl/core.c | 39 +++++++-------
> fs/resctrl/rdtgroup.c | 83 +++++++++++++++++++++---------
> 3 files changed, 79 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index d6b09952ef92..c02a4d59f3eb 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -444,9 +444,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type type);
> int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_online_cpu(unsigned int cpu);
> void resctrl_offline_cpu(unsigned int cpu);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index e4125161ffbd..71b884f25475 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -458,9 +458,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (hdr) {
> if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> return;
> - d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> -
> - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + cpumask_set_cpu(cpu, &hdr->cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> return;
> @@ -524,7 +522,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>
> list_add_tail_rcu(&d->hdr.list, add_pos);
>
> - err = resctrl_online_mon_domain(r, d);
> + err = resctrl_online_mon_domain(r, &d->hdr);
> if (err) {
> list_del_rcu(&d->hdr.list);
> synchronize_rcu();
> @@ -597,25 +595,24 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> return;
>
> + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
Looks like variable 'd' is uninitialized when used here. Can you please
check?
> + if (!cpumask_empty(&hdr->cpu_mask))
> + return;
> +
> d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> hw_dom = resctrl_to_arch_ctrl_dom(d);
>
> - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> - if (cpumask_empty(&d->hdr.cpu_mask)) {
> - resctrl_offline_ctrl_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> - synchronize_rcu();
> -
> - /*
> - * rdt_ctrl_domain "d" is going to be freed below, so clear
> - * its pointer from pseudo_lock_region struct.
> - */
> - if (d->plr)
> - d->plr->d = NULL;
> - ctrl_domain_free(hw_dom);
> + resctrl_offline_ctrl_domain(r, d);
> + list_del_rcu(&hdr->list);
> + synchronize_rcu();
>
> - return;
> - }
> + /*
> + * rdt_ctrl_domain "d" is going to be freed below, so clear
> + * its pointer from pseudo_lock_region struct.
> + */
> + if (d->plr)
> + d->plr->d = NULL;
> + ctrl_domain_free(hw_dom);
> }
>
> static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> @@ -651,8 +648,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> case RDT_RESOURCE_L3:
> d = container_of(hdr, struct rdt_mon_domain, hdr);
> hw_dom = resctrl_to_arch_mon_dom(d);
> - resctrl_offline_mon_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> + resctrl_offline_mon_domain(r, hdr);
> + list_del_rcu(&hdr->list);
> synchronize_rcu();
> l3_mon_domain_free(hw_dom);
> break;
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 828c743ec470..0213fb3a1113 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3022,7 +3022,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> * when last domain being summed is removed.
> */
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct rdtgroup *prgrp, *crgrp;
> char subname[32];
> @@ -3030,9 +3030,17 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> char name[32];
>
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> - if (snc_mode)
> - sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + if (snc_mode) {
> + struct rdt_mon_domain *d;
> +
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> + sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
> + } else {
> + sprintf(name, "mon_%s_%02d", r->name, hdr->id);
> + }
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> @@ -3042,11 +3050,12 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> }
> }
>
> -static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp,
> bool do_sum)
> {
> struct rmid_read rr = {0};
> + struct rdt_mon_domain *d;
> struct mon_data *priv;
> struct mon_evt *mevt;
> int ret, domid;
> @@ -3054,7 +3063,14 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
> if (mevt->rid != r->rid || !mevt->enabled)
> continue;
> - domid = do_sum ? d->ci->id : d->hdr.id;
> + if (r->rid == RDT_RESOURCE_L3) {
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return -EINVAL;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + domid = do_sum ? d->ci->id : d->hdr.id;
> + } else {
> + domid = hdr->id;
> + }
> priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
> if (WARN_ON_ONCE(!priv))
> return -EINVAL;
> @@ -3063,18 +3079,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> if (ret)
> return ret;
>
> - if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
> - mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
> + if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
> + mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt->evtid, true);
> }
>
> return 0;
> }
>
> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> - struct rdt_mon_domain *d,
> + struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> struct kernfs_node *kn, *ckn;
> + struct rdt_mon_domain *d;
> char name[32];
> bool snc_mode;
> int ret = 0;
> @@ -3082,7 +3099,14 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> lockdep_assert_held(&rdtgroup_mutex);
>
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> + if (snc_mode) {
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return -EINVAL;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> + } else {
> + sprintf(name, "mon_%s_%02d", r->name, hdr->id);
> + }
> kn = kernfs_find_and_get(parent_kn, name);
> if (kn) {
> /*
> @@ -3098,13 +3122,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> ret = rdtgroup_kn_set_ugid(kn);
> if (ret)
> goto out_destroy;
> - ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
> + ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
> if (ret)
> goto out_destroy;
> }
>
> if (snc_mode) {
> - sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
> ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> if (IS_ERR(ckn)) {
> ret = -EINVAL;
> @@ -3115,7 +3139,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> if (ret)
> goto out_destroy;
>
> - ret = mon_add_all_files(ckn, d, r, prgrp, false);
> + ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
> if (ret)
> goto out_destroy;
> }
> @@ -3133,7 +3157,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> * and "monitor" groups with given domain id.
> */
> static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct kernfs_node *parent_kn;
> struct rdtgroup *prgrp, *crgrp;
> @@ -3141,12 +3165,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> parent_kn = prgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
>
> head = &prgrp->mon.crdtgrp_list;
> list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
> parent_kn = crgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, crgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
> }
> }
> }
> @@ -3155,14 +3179,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_resource *r,
> struct rdtgroup *prgrp)
> {
> - struct rdt_mon_domain *dom;
> + struct rdt_domain_hdr *hdr;
> int ret;
>
> /* Walking r->domains, ensure it can't race with cpuhp */
> lockdep_assert_cpus_held();
>
> - list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> - ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> + list_for_each_entry(hdr, &r->mon_domains, list) {
> + ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
> if (ret)
> return ret;
> }
> @@ -4030,8 +4054,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> + struct rdt_mon_domain *d;
> +
> mutex_lock(&rdtgroup_mutex);
>
> /*
> @@ -4039,11 +4065,15 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> * per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - rmdir_mondata_subdir_allrdtgrp(r, d);
> + rmdir_mondata_subdir_allrdtgrp(r, hdr);
>
> if (r->rid != RDT_RESOURCE_L3)
> goto done;
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> if (resctrl_is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
> @@ -4126,12 +4156,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
> return err;
> }
>
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> - int err;
> + struct rdt_mon_domain *d;
> + int err = -EINVAL;
>
> mutex_lock(&rdtgroup_mutex);
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + goto out_unlock;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> err = domain_setup_l3_mon_state(r, d);
> if (err)
> goto out_unlock;
> @@ -4152,7 +4187,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> * If resctrl is mounted, add per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - mkdir_mondata_subdir_allrdtgrp(r, d);
> + mkdir_mondata_subdir_allrdtgrp(r, hdr);
>
> out_unlock:
> mutex_unlock(&rdtgroup_mutex);
^ permalink raw reply [flat|nested] 90+ messages in thread
* RE: [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr
2025-05-22 0:01 ` Keshavamurthy, Anil S
@ 2025-05-22 0:15 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-05-22 0:15 UTC (permalink / raw)
To: Keshavamurthy, Anil S, Fenghua Yu, Chatre, Reinette,
Wieczor-Retman, Maciej, Peter Newman, James Morse, Babu Moger,
Drew Fustini, Dave Martin, Chen, Yu C
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
> > @@ -597,25 +595,24 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> > if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> > return;
> >
> > + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> Looks like variable 'd' is uninitialized when used here. Can you please
> check?
Good catch. I missed switching that to:
cpumask_clear_cpu(cpu, &hdr->cpu_mask);
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events
2025-05-21 22:50 ` [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events Tony Luck
@ 2025-05-23 9:00 ` Peter Newman
2025-05-23 15:57 ` Luck, Tony
2025-06-04 3:29 ` Reinette Chatre
2025-06-07 0:45 ` Fenghua Yu
2 siblings, 1 reply; 90+ messages in thread
From: Peter Newman @ 2025-05-23 9:00 UTC (permalink / raw)
To: Tony Luck
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On Thu, May 22, 2025 at 12:51 AM Tony Luck <tony.luck@intel.com> wrote:
>
> There's a rule in computer programming that objects appear zero,
> once, or many times. So code accordingly.
>
> There are two MBM events and resctrl is coded with a lot of
>
> if (local)
> do one thing
> if (total)
> do a different thing
>
> Change the rdt_mon_domain and rdt_hw_mon_domain structures to hold arrays
> of pointers to per event data instead of explicit fields for total and
> local bandwidth.
>
> Simplify by coding for many events using loops on which are enabled.
>
> Move resctrl_is_mbm_event() to <linux/resctrl.h> so it can be used more
> widely. Also provide a for_each_mbm_event() helper macro.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 15 +++++---
> include/linux/resctrl_types.h | 3 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 6 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 38 ++++++++++----------
> arch/x86/kernel/cpu/resctrl/monitor.c | 36 +++++++++----------
> fs/resctrl/monitor.c | 13 ++++---
> fs/resctrl/rdtgroup.c | 48 ++++++++++++--------------
> 7 files changed, 82 insertions(+), 77 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 843ad7c8e247..40f2d0d48d02 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -161,8 +161,7 @@ struct rdt_ctrl_domain {
> * @hdr: common header for different domain types
> * @ci: cache info for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> - * @mbm_total: saved state for MBM total bandwidth
> - * @mbm_local: saved state for MBM local bandwidth
> + * @mbm_states: saved state for each QOS MBM event
> * @mbm_over: worker to periodically read MBM h/w counters
> * @cqm_limbo: worker to periodically read CQM h/w counters
> * @mbm_work_cpu: worker CPU for MBM h/w counters
> @@ -172,8 +171,7 @@ struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> struct cacheinfo *ci;
> unsigned long *rmid_busy_llc;
> - struct mbm_state *mbm_total;
> - struct mbm_state *mbm_local;
> + struct mbm_state *mbm_states[QOS_NUM_L3_MBM_EVENTS];
> struct delayed_work mbm_over;
> struct delayed_work cqm_limbo;
> int mbm_work_cpu;
> @@ -376,6 +374,15 @@ bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
>
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
>
> +static inline bool resctrl_is_mbm_event(enum resctrl_event_id e)
> +{
> + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> + e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +#define for_each_mbm_event(evt) \
> + for (evt = QOS_L3_MBM_TOTAL_EVENT_ID; evt <= QOS_L3_MBM_LOCAL_EVENT_ID; evt++)
> +
> /**
> * resctrl_arch_mon_event_config_write() - Write the config for an event.
> * @config_info: struct resctrl_mon_config_info describing the resource, domain
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index a25fb9c4070d..b468bfbab9ea 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -47,4 +47,7 @@ enum resctrl_event_id {
> QOS_NUM_EVENTS,
> };
>
> +#define QOS_NUM_L3_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> +#define MBM_STATE_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
> +
> #endif /* __LINUX_RESCTRL_TYPES_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 5e3c41b36437..ea185b4d0d59 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -54,15 +54,13 @@ struct rdt_hw_ctrl_domain {
> * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> * a resource for a monitor function
> * @d_resctrl: Properties exposed to the resctrl file system
> - * @arch_mbm_total: arch private state for MBM total bandwidth
> - * @arch_mbm_local: arch private state for MBM local bandwidth
> + * @arch_mbm_states: arch private state for each MBM event
> *
> * Members of this structure are accessed via helpers that provide abstraction.
> */
> struct rdt_hw_mon_domain {
> struct rdt_mon_domain d_resctrl;
> - struct arch_mbm_state *arch_mbm_total;
> - struct arch_mbm_state *arch_mbm_local;
> + struct arch_mbm_state *arch_mbm_states[QOS_NUM_L3_MBM_EVENTS];
> };
>
> static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 819bc7a09327..4403a820db12 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -364,8 +364,8 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
>
> static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
> {
> - kfree(hw_dom->arch_mbm_total);
> - kfree(hw_dom->arch_mbm_local);
> + for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
> + kfree(hw_dom->arch_mbm_states[i]);
> kfree(hw_dom);
> }
>
> @@ -399,25 +399,27 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
> */
> static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
> {
> - size_t tsize;
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> - tsize = sizeof(*hw_dom->arch_mbm_total);
> - hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
> - if (!hw_dom->arch_mbm_total)
> - return -ENOMEM;
> - }
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
> - tsize = sizeof(*hw_dom->arch_mbm_local);
> - hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
> - if (!hw_dom->arch_mbm_local) {
> - kfree(hw_dom->arch_mbm_total);
> - hw_dom->arch_mbm_total = NULL;
> - return -ENOMEM;
> - }
> + size_t tsize = sizeof(struct arch_mbm_state);
sizeof(*hw_dom->arch_mbm_states[0])?
The previous code didn't assume a type.
> + enum resctrl_event_id evt;
> + int idx;
> +
> + for_each_mbm_event(evt) {
> + if (!resctrl_is_mon_event_enabled(evt))
> + continue;
> + idx = MBM_STATE_IDX(evt);
> + hw_dom->arch_mbm_states[idx] = kcalloc(num_rmid, tsize, GFP_KERNEL);
> + if (!hw_dom->arch_mbm_states[idx])
> + goto cleanup;
> }
>
> return 0;
> +cleanup:
> + while (--idx >= 0) {
> + kfree(hw_dom->arch_mbm_states[idx]);
> + hw_dom->arch_mbm_states[idx] = NULL;
> + }
> +
> + return -ENOMEM;
> }
>
> static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index fda579251dba..85526e5540f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,18 +160,14 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
> u32 rmid,
> enum resctrl_event_id eventid)
> {
> - switch (eventid) {
> - case QOS_L3_OCCUP_EVENT_ID:
> - return NULL;
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &hw_dom->arch_mbm_total[rmid];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &hw_dom->arch_mbm_local[rmid];
> - default:
> - /* Never expect to get here */
> - WARN_ON_ONCE(1);
> + struct arch_mbm_state *state;
> +
> + if (!resctrl_is_mbm_event(eventid))
> return NULL;
> - }
> +
> + state = hw_dom->arch_mbm_states[MBM_STATE_IDX(eventid)];
> +
> + return state ? &state[rmid] : NULL;
> }
>
> void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> @@ -200,14 +196,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_total, 0,
> - sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_local, 0,
> - sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
> + enum resctrl_event_id evt;
> + int idx;
> +
> + for_each_mbm_event(evt) {
> + idx = MBM_STATE_IDX(evt);
> + if (!hw_dom->arch_mbm_states[idx])
> + continue;
> + memset(hw_dom->arch_mbm_states[idx], 0,
> + sizeof(struct arch_mbm_state) * r->num_rmid);
> + }
> }
>
> static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 325e23c1a403..4cd0789998bf 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -346,15 +346,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> u32 rmid, enum resctrl_event_id evtid)
> {
> u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> + struct mbm_state *states;
>
> - switch (evtid) {
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &d->mbm_total[idx];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &d->mbm_local[idx];
> - default:
> + if (!resctrl_is_mbm_event(evtid))
> return NULL;
> - }
> +
> + states = d->mbm_states[MBM_STATE_IDX(evtid)];
> +
> + return states ? &states[idx] : NULL;
> }
>
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 80e74940281a..8649b89d7bfd 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -127,12 +127,6 @@ static bool resctrl_is_mbm_enabled(void)
> resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID));
> }
>
> -static bool resctrl_is_mbm_event(int e)
> -{
> - return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> - e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> -}
> -
> /*
> * Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap
> * of free CLOSIDs.
> @@ -4020,8 +4014,10 @@ static void rdtgroup_setup_default(void)
> static void domain_destroy_mon_state(struct rdt_mon_domain *d)
> {
> bitmap_free(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - kfree(d->mbm_local);
> + for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++) {
> + kfree(d->mbm_states[i]);
> + d->mbm_states[i] = NULL;
> + }
> }
>
> void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> @@ -4081,32 +4077,34 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> - size_t tsize;
> + size_t tsize = sizeof(struct mbm_state);
Here too.
Thanks!
-Peter
^ permalink raw reply [flat|nested] 90+ messages in thread
* RE: [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events
2025-05-23 9:00 ` Peter Newman
@ 2025-05-23 15:57 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-05-23 15:57 UTC (permalink / raw)
To: Peter Newman
Cc: Fenghua Yu, Chatre, Reinette, Wieczor-Retman, Maciej, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Keshavamurthy, Anil S,
Chen, Yu C, x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
> > static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
> > {
> > - size_t tsize;
> > -
> > - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> > - tsize = sizeof(*hw_dom->arch_mbm_total);
> > - hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
> > - if (!hw_dom->arch_mbm_total)
> > - return -ENOMEM;
> > - }
> > - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
> > - tsize = sizeof(*hw_dom->arch_mbm_local);
> > - hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
> > - if (!hw_dom->arch_mbm_local) {
> > - kfree(hw_dom->arch_mbm_total);
> > - hw_dom->arch_mbm_total = NULL;
> > - return -ENOMEM;
> > - }
> > + size_t tsize = sizeof(struct arch_mbm_state);
>
> sizeof(*hw_dom->arch_mbm_states[0])?
>
> The previous code didn't assume a type.
>
> > static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
> > {
> > u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> > - size_t tsize;
> > + size_t tsize = sizeof(struct mbm_state);
>
> Here too.
Peter,
Thanks for looking. I will fix both places for next version.
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 05/29] x86/rectrl: Fake OOBMSM interface
2025-05-21 22:50 ` [PATCH v5 05/29] x86/rectrl: Fake OOBMSM interface Tony Luck
@ 2025-05-23 23:38 ` Reinette Chatre
2025-05-27 20:25 ` [PATCH v5 05/29 UPDATED] x86/resctrl: " Tony Luck
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-05-23 23:38 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
shortlog: x86/rectrl -> x86/resctrl
How many reports of a typo does it take for typo to be fixed?
V2: https://lore.kernel.org/all/b69bee17-6a84-4cb2-ab8a-2793c2fe7c49@intel.com/
V3: https://lore.kernel.org/lkml/2897fc2a-8977-4415-ae6d-bd0002874b3a@intel.com/
On 5/21/25 3:50 PM, Tony Luck wrote:
> +
> +/*
> + * Amount of memory for each fake MMIO space
> + * Magic numbers here match values for XML ID 0x26696143 and 0x26557651
> + * 576: Number of RMIDs
> + * 2: Energy events in 0x26557651
> + * 7: Perf events in 0x26696143
> + * 3: Qwords for status counters after the event counters
> + * 8: Bytes for each counter
> + */
> +
> +#define ENERGY_QWORDS ((576 * 2) + 3)
> +#define ENERGY_SIZE (ENERGY_QWORDS * 8)
> +#define PERF_QWORDS ((576 * 7) + 3)
> +#define PERF_SIZE (PERF_QWORDS * 8)
> +
First time asking why energy and perf are both using 576 RMIDs:
V3: https://lore.kernel.org/lkml/2897fc2a-8977-4415-ae6d-bd0002874b3a@intel.com/
Reminded in V4 that V3's question has not been answered:
https://lore.kernel.org/lkml/7fa19421-9093-411b-b8e2-da56156a9971@intel.com/
Question is still not answered in this version (neither was it answered in
response to the emails where I asked the questions).
> +
> +/*
> + * Set up a fake return for call to:
> + * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
> + * Pretend there are two aggregators on each of the sockets to test
> + * the code that sums over multiple aggregators.
> + * Pretend this group only supports 64 RMIDs to exercise the code
> + * that reconciles support for different RMID counts.
> + */
This version adds this comment. How does it answer the original question?
The comment highlights that energy uses 64 RMIDs and thus highlights that
it is unexpected that above defines uses 576 for energy RMIDs.
Reader is left to decipher the code from multiple patches later to try make
sense of this without it every explained.
Repeating a question and reporting a typo three times makes for an
unproductive and frustrating review. I usually work through a whole series
before I post feedback but when I got to this patch I did not feel like going
further. Why should I bother?
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* [PATCH v5 05/29 UPDATED] x86/resctrl: Fake OOBMSM interface
2025-05-23 23:38 ` Reinette Chatre
@ 2025-05-27 20:25 ` Tony Luck
0 siblings, 0 replies; 90+ messages in thread
From: Tony Luck @ 2025-05-27 20:25 UTC (permalink / raw)
To: reinette.chatre
Cc: Dave.Martin, anil.s.keshavamurthy, babu.moger, dfustini, fenghuay,
james.morse, linux-kernel, maciej.wieczor-retman, patches,
peternewman, tony.luck, x86, yu.c.chen
=== Changes since original v5 version of this patch ===
1) Fix typo in Subject s/rectrl/resctrl/
2) Explain all constants used for fake pmt_feature_group structures
3) Explain difference between the number of RMIDs described by the
XML files, and the number provided in telemetry_region::num_rmids.
4) Provide separate u64 arrays for each of the fake MMIO space regions
and initialize so that event registers beyond the limit of number
of hardware counters read as zero (i.e. without the DATA_VALID bit
set).
===
Real version is coming soon[1]. This is here so the remaining parts
will build (and run: assuming a 2 socket system that supports RDT
monitoring).
Missing parts:
1) The event counters just report fixed values.
2) No emulation of most-recently-used for aggregators that have fewer
hardware counters than RMID registers in MMIO space.
Faked values are provided to exercise some special conditions:
1) Multiple counter aggregators for an event per-socket.
2) Different number of hardware counters backing the RMIDs for each group.
Just for ease of testing and RFC discussion.
[1]
Link: https://lore.kernel.org/all/20250430212106.369208-1-david.e.box@linux.intel.com/
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
.../cpu/resctrl/fake_intel_aet_features.h | 73 ++++++++
.../cpu/resctrl/fake_intel_aet_features.c | 158 ++++++++++++++++++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
3 files changed, 232 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
new file mode 100644
index 000000000000..c835c4108abc
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Bits stolen from OOBMSM VSEC discovery code */
+
+enum pmt_feature_id {
+ FEATURE_INVALID = 0x0,
+ FEATURE_PER_CORE_PERF_TELEM = 0x1,
+ FEATURE_PER_CORE_ENV_TELEM = 0x2,
+ FEATURE_PER_RMID_PERF_TELEM = 0x3,
+ FEATURE_ACCEL_TELEM = 0x4,
+ FEATURE_UNCORE_TELEM = 0x5,
+ FEATURE_CRASH_LOG = 0x6,
+ FEATURE_PETE_LOG = 0x7,
+ FEATURE_TPMI_CTRL = 0x8,
+ FEATURE_RESERVED = 0x9,
+ FEATURE_TRACING = 0xA,
+ FEATURE_PER_RMID_ENERGY_TELEM = 0xB,
+ FEATURE_MAX = 0xB,
+};
+
+/**
+ * struct oobmsm_plat_info - Platform information for a device instance
+ * @cdie_mask: Mask of all compute dies in the partition
+ * @package_id: CPU Package id
+ * @partition: Package partition id when multiple VSEC PCI devices per package
+ * @segment: PCI segment ID
+ * @bus_number: PCI bus number
+ * @device_number: PCI device number
+ * @function_number: PCI function number
+ *
+ * Structure to store platform data for a OOBMSM device instance.
+ */
+struct oobmsm_plat_info {
+ u16 cdie_mask;
+ u8 package_id;
+ u8 partition;
+ u8 segment;
+ u8 bus_number;
+ u8 device_number;
+ u8 function_number;
+};
+
+enum oobmsm_supplier_type {
+ OOBMSM_SUP_PLAT_INFO,
+ OOBMSM_SUP_DISC_INFO,
+ OOBMSM_SUP_S3M_SIMICS,
+ OOBMSM_SUP_TYPE_MAX
+};
+
+struct oobmsm_mapping_supplier {
+ struct device *supplier_dev[OOBMSM_SUP_TYPE_MAX];
+ struct oobmsm_plat_info plat_info;
+ unsigned long features;
+};
+
+struct telemetry_region {
+ struct oobmsm_plat_info plat_info;
+ void __iomem *addr;
+ size_t size;
+ u32 guid;
+ u32 num_rmids;
+};
+
+struct pmt_feature_group {
+ enum pmt_feature_id id;
+ int count;
+ struct kref kref;
+ struct telemetry_region regions[];
+};
+
+struct pmt_feature_group *intel_pmt_get_regions_by_feature(enum pmt_feature_id id);
+
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group);
diff --git a/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
new file mode 100644
index 000000000000..b2f1afee063d
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/cleanup.h>
+#include <linux/minmax.h>
+#include <linux/slab.h>
+#include "fake_intel_aet_features.h"
+#include <linux/intel_vsec.h>
+#include <linux/resctrl.h>
+
+#include "internal.h"
+
+/*
+ * The following constants taken from Intel-PMT github repository at
+ * Link: https://github.com/intel/Intel-PMT
+ */
+
+/*
+ * XML file at xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
+ */
+#define ENERGY_GUID 0x26696143 /* Listed as "<TELI:uniqueid>" in XML */
+#define ENERGY_RMIDS 576 /* Register definitions run from 0 to 575 */
+#define ENERGY_NUM_EVENTS 2 /* CORE_ENERGY .. ACTIVITY */
+#define ENERGY_STATUS_REGS 3 /* Number of status registers at end of MMIO */
+
+/*
+ * XML file at xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
+ */
+#define PERF_GUID 0x26557651 /* Listed as "<TELI:uniqueid>" in XML */
+#define PERF_RMIDS 576 /* Register definitions run from 0 to 575 */
+#define PERF_NUM_EVENTS 7 /* STALLS_LLC_HIT .. UOPS_RETIRED_VALID */
+#define PERF_STATUS_REGS 3 /* Number of status registers at end of MMIO */
+
+/*
+ * Size of MMIO space in each telemetry aggregator for energy events.
+ */
+#define ENERGY_QWORDS ((ENERGY_RMIDS * ENERGY_NUM_EVENTS) + ENERGY_STATUS_REGS)
+#define ENERGY_SIZE (ENERGY_QWORDS * sizeof(u64))
+
+/*
+ * Size of MMIO space in each telemetry aggregator for perf events.
+ */
+#define PERF_QWORDS ((PERF_RMIDS * PERF_NUM_EVENTS) + PERF_STATUS_REGS)
+#define PERF_SIZE (PERF_QWORDS * sizeof(u64))
+
+/*
+ * These next numbers are made up out of thin air. Chosen to exercise
+ * various configurations (some present in the first implementation
+ * of telemetry events, others will appear in later implementations).
+ */
+#define ENERGY_TELEM_AGGREGATORS_PER_SOCKET 2
+#define PERF_TELEM_AGGREGATORS_PER_SOCKET 1
+#define NUM_SOCKETS 2
+
+/*
+ * The number of implemented hardware counters for a telemetry
+ * aggregator may be smaller than the number of MMIO registers
+ * allocated. When this happens the hardware uses a most recently
+ * used algorithm to attach counters to MMIO registers.
+ * MMIO registers that are not backed by counters read with
+ * BIT(63) as zero. This fake code does not attempt to
+ * fully emulate the MRU algorithm. But it does provide return
+ * from intel_pmt_get_regions_by_feature() that indicates
+ * fewer hardware counters.
+ */
+#define ENERGY_NUM_HARDWARE_COUNTERS 64
+#define PERF_NUM_HARDWARE_COUNTERS 576
+
+
+/* Fake MMIO space of each fake energy aggregator */
+static u64 fake_energy_0[ENERGY_QWORDS];
+static u64 fake_energy_1[ENERGY_QWORDS];
+static u64 fake_energy_2[ENERGY_QWORDS];
+static u64 fake_energy_3[ENERGY_QWORDS];
+
+/* Fake MMIO space of each fake perf aggregator */
+static u64 fake_perf_0[PERF_QWORDS];
+static u64 fake_perf_1[PERF_QWORDS];
+
+/*
+ * Fill the each fake MMIO space with all different values,
+ * all with BIT(63) set to indicate valid entries.
+ */
+static int __init fill(void)
+{
+ u64 val = 0;
+ int i;
+
+ for (i = 0; i < ENERGY_NUM_HARDWARE_COUNTERS * ENERGY_NUM_EVENTS; i++)
+ fake_energy_0[i] = BIT_ULL(63) + val++;
+ for (i = 0; i < ENERGY_NUM_HARDWARE_COUNTERS * ENERGY_NUM_EVENTS; i++)
+ fake_energy_1[i] = BIT_ULL(63) + val++;
+ for (i = 0; i < ENERGY_NUM_HARDWARE_COUNTERS * ENERGY_NUM_EVENTS; i++)
+ fake_energy_2[i] = BIT_ULL(63) + val++;
+ for (i = 0; i < ENERGY_NUM_HARDWARE_COUNTERS * ENERGY_NUM_EVENTS; i++)
+ fake_energy_3[i] = BIT_ULL(63) + val++;
+
+ for (i = 0; i < PERF_QWORDS; i++)
+ fake_perf_0[i] = BIT_ULL(63) + val++;
+ for (i = 0; i < PERF_QWORDS; i++)
+ fake_perf_1[i] = BIT_ULL(63) + val++;
+
+ return 0;
+}
+device_initcall(fill);
+
+#define PKG_REGION(_entry, _guid, _addr, _size, _pkg, _num_rmids) \
+ [_entry] = { .guid = _guid, .addr = (void __iomem *)_addr, \
+ .num_rmids = _num_rmids, \
+ .size = _size, .plat_info = { .package_id = _pkg }}
+
+/*
+ * Set up a fake return for call to:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_ENERGY_TELEM);
+ * Pretend there are two aggregators on each of the sockets to test
+ * the code that sums over multiple aggregators.
+ * Pretend this group only supports 64 RMIDs to exercise the code
+ * that reconciles support for different RMID counts.
+ */
+static struct pmt_feature_group fake_energy = {
+ .count = NUM_SOCKETS * ENERGY_TELEM_AGGREGATORS_PER_SOCKET,
+ .regions = {
+ PKG_REGION(0, ENERGY_GUID, fake_energy_0, ENERGY_SIZE, 0, ENERGY_NUM_HARDWARE_COUNTERS),
+ PKG_REGION(1, ENERGY_GUID, fake_energy_1, ENERGY_SIZE, 0, ENERGY_NUM_HARDWARE_COUNTERS),
+ PKG_REGION(2, ENERGY_GUID, fake_energy_2, ENERGY_SIZE, 1, ENERGY_NUM_HARDWARE_COUNTERS),
+ PKG_REGION(3, ENERGY_GUID, fake_energy_3, ENERGY_SIZE, 1, ENERGY_NUM_HARDWARE_COUNTERS)
+ }
+};
+
+/*
+ * Fake return for:
+ * intel_pmt_get_regions_by_feature(FEATURE_PER_RMID_PERF_TELEM);
+ */
+static struct pmt_feature_group fake_perf = {
+ .count = NUM_SOCKETS * PERF_TELEM_AGGREGATORS_PER_SOCKET,
+ .regions = {
+ PKG_REGION(0, PERF_GUID, fake_perf_0, PERF_SIZE, 0, PERF_NUM_HARDWARE_COUNTERS),
+ PKG_REGION(1, PERF_GUID, fake_perf_1, PERF_SIZE, 1, PERF_NUM_HARDWARE_COUNTERS)
+ }
+};
+
+struct pmt_feature_group *
+intel_pmt_get_regions_by_feature(enum pmt_feature_id id)
+{
+ switch (id) {
+ case FEATURE_PER_RMID_ENERGY_TELEM:
+ return &fake_energy;
+ case FEATURE_PER_RMID_PERF_TELEM:
+ return &fake_perf;
+ default:
+ return ERR_PTR(-ENOENT);
+ }
+}
+
+/*
+ * Nothing needed for the "put" function.
+ */
+void intel_pmt_put_feature_group(struct pmt_feature_group *feature_group)
+{
+}
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index d8a04b195da2..cf4fac58d068 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_X86_CPU_RESCTRL) += fake_intel_aet_features.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
# To allow define_trace.h's recursive include:
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (28 preceding siblings ...)
2025-05-21 22:50 ` [PATCH v5 29/29] x86/resctrl: Update Documentation for package events Tony Luck
@ 2025-05-28 17:21 ` Reinette Chatre
2025-05-28 21:38 ` Luck, Tony
2025-06-13 16:57 ` James Morse
30 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-05-28 17:21 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Background
> ----------
>
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOBMSM device to be
(could you please expand what "OOBMSM" means somewhere?)
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.
Could you please use consistent terminology? The short paragraph above
uses "logical CPU", "CPU", and "core" seemingly interchangeably and that is
confusing since these terms mean different things on x86 (re.
Documentation/arch/x86/topology.rst). (more below)
>
> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package. In the initial
("logical CPU" ... but seems to be used in same context as "core" in
first paragraph?)
> hardware implementation, there are two categories of events: energy
> and perf.
>
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.
With RMIDs being per logical CPU it is not obvious to me how these
events should be treated since most of them are described as core events.
If the RMID is per logical CPU and the events are per core then how should
the counters be interpreted? Would the user, for example, need to set CPU
affinity to ensure only tasks within same monitor group are run on the same
cores? How else can it be ensured the data reflects the monitor group it
is reported for?
> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).
(No scope for this event.)
>
> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with much lower overhead (no need to collect data
> on every context switch).
>
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC
(core scope)
>
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks
("across all cores" ... package scope?)
>
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles
(core)
>
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches
(core)
>
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
>
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
>
> uops_retired - Counts the total number of uops retired
(no scope in descriptions of the above)
>
> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
>
> Offset: Counter
> 0x00 core energy for RMID 0
> 0x08 core activity for RMID 0
> 0x10 core energy for RMID 1
> 0x18 core activity for RMID 1
> ...
Does seems to hint that counters/events are always per core (but the descriptions
do not always reflect that) while RMID is per logical-CPU.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring
2025-05-28 17:21 ` [PATCH v5 00/29] x86/resctrl telemetry monitoring Reinette Chatre
@ 2025-05-28 21:38 ` Luck, Tony
2025-05-28 22:21 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-05-28 21:38 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Reinette,
I've begun drafting a new cover letter to explain telemetry.
Here's the introduction. Let me know if it helps cover the
gaps and ambiguities that you pointed out.
-Tony
RMID based telemetry events
---------------------------
Each CPU on a system keeps a local count of various events.
Every two milliseconds, or when the value of the RMID field in the
IA32_PQR_ASSOC MSR is changed, the CPU transmits all the event counts
together with the value of the RMID to a nearby OOBMSM (Out of band
management services module) device. The CPU then resets all counters and
begins counting events for the new RMID or time interval.
The OOBMSM device sums each event count with those received from other
CPUs keeping a running total for each event for each RMID.
The operating system can read these counts to gather a picture of
system-wide activity for each of the logged events per-RMID.
E.g. the operating system may assign RMID 5 to all the tasks running to
perform a certain job. When it reads the core energy event counter for
RMID 5 it will see the total energy consumed by CPU cores for all tasks
in that job while running on any CPU. This is a much lower overhead
mechanism to track events per job than the typical "perf" approach
of reading counters on every context switch.
Events
------
"core energy" The number of Joules consumed by CPU cores during execution
of instructions for the current RMID.
Note that this does not include energy used by the "uncore" (LLC cache
and interfaces to off package devices) or energy used by memory or I/O
devices. Energy may be calculated based on measures of activity rather
than the output from a power meter.
"activity" The dynamic capacitance (Cdyn) in Farads for a core due to
execution of instructions for the current RMID. This event will be
more useful to a user interested in optimizing energy consumption
of a workload because it is invariant of frequency changes (e.g.
turbo mode) that may be outside of the control of the developer.
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring
2025-05-28 21:38 ` Luck, Tony
@ 2025-05-28 22:21 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-05-28 22:21 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 5/28/25 2:38 PM, Luck, Tony wrote:
> Hi Reinette,
>
> I've begun drafting a new cover letter to explain telemetry.
>
> Here's the introduction. Let me know if it helps cover the
> gaps and ambiguities that you pointed out.
>
> -Tony
>
>
> RMID based telemetry events
> ---------------------------
>
> Each CPU on a system keeps a local count of various events.
>
> Every two milliseconds, or when the value of the RMID field in the
> IA32_PQR_ASSOC MSR is changed, the CPU transmits all the event counts
> together with the value of the RMID to a nearby OOBMSM (Out of band
> management services module) device. The CPU then resets all counters and
> begins counting events for the new RMID or time interval.
>
> The OOBMSM device sums each event count with those received from other
> CPUs keeping a running total for each event for each RMID.
>
> The operating system can read these counts to gather a picture of
> system-wide activity for each of the logged events per-RMID.
>
> E.g. the operating system may assign RMID 5 to all the tasks running to
> perform a certain job. When it reads the core energy event counter for
> RMID 5 it will see the total energy consumed by CPU cores for all tasks
> in that job while running on any CPU. This is a much lower overhead
> mechanism to track events per job than the typical "perf" approach
> of reading counters on every context switch.
>
Could you please elaborate the CPU vs core distinction?
If the example above is for a system with below topology (copied from
Documentation/arch/x86/topology.rst):
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-> [thread 1] -> Linux CPU 1
-> [core 1] -> [thread 0] -> Linux CPU 2
-> [thread 1] -> Linux CPU 3
In the example, RMID 5 is assigned to tasks running "a certain job", for
convenience I will name it "jobA". Consider if the example is extended
with RMID 6 assigned to tasks running another job, "jobB".
If a jobA task is scheduled on CPU 0 and a jobB task is scheduled in CPU 1
then it may look like:
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 #RMID 5
-> [thread 1] -> Linux CPU 1 #RMID 6
-> [core 1] -> [thread 0] -> Linux CPU 2
-> [thread 1] -> Linux CPU 3
The example above states:
When it reads the core energy event counter for RMID 5 it will
see the total energy consumed by CPU cores for all tasks in that
job while running on any CPU.
With RMID 5 and RMID 6 both running on core 0, and "RMID 5 will see
the total energy consumed by CPU cores", does this mean that reading RMID 5
counter will return the energy consumed by core 0 while RMID 5 is assigned to
CPU 0? Since core 0 contains both CPU 0 and CPU 1, would reading RMID 5 thus return
data of both RMID 5 and RMID 6 (jobA and jobB)?
And vice versa, reading RMID 6 will also include energy consumed by tasks
running with RMID 5?
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions
2025-05-21 22:50 ` [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions Tony Luck
@ 2025-06-04 3:25 ` Reinette Chatre
2025-06-04 16:33 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:25 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> There are currently only three monitor events, all associated with
> the RDT_RESOURCE_L3 resource. Growing support for additional events
> will be easier with some restructuring to have a single point in
> file system code where all attributes of all events are defined.
>
> Place all event descriptions into an array mon_event_all[]. Doing
> this has the beneficial side effect of removing the need for
> rdt_resource::evt_list.
>
> Drop the code that builds evt_list and change the two places where
> the list is scanned to scan mon_event_all[] instead.
>
> Architecture code now informs file system code which events are
> available with resctrl_enable_mon_event().
nit: extra space above
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
...
> @@ -372,6 +370,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
> u32 resctrl_arch_system_num_rmid_idx(void);
> int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
>
> +void resctrl_enable_mon_event(enum resctrl_event_id evtid);
> +
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
nit: When code is consistent in name use it is easier to read.
Above there is already resctrl_arch_is_evt_configurable() that uses "evt"
as parameter name so naming the new parameter "evt" instead of "evtid"
will be much easier on the eye to make clear that this is the "same thing".
Also later, when resctrl_is_mbm_event() is moved it will be clean to have
it also use "evt" as parameter name and not end up with three different
"evtid", "evt", and "e" for these related functions.
>
> /**
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 9a8cf6f11151..94e635656261 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -52,19 +52,23 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> }
>
> /**
> - * struct mon_evt - Entry in the event list of a resource
> + * struct mon_evt - Description of a monitor event
nit: "Description" -> "Properties"?
> * @evtid: event id
> + * @rid: index of the resource for this event
> * @name: name of the event
> * @configurable: true if the event is configurable
> - * @list: entry in &rdt_resource->evt_list
> + * @enabled: true if the event is enabled
> */
> struct mon_evt {
> enum resctrl_event_id evtid;
> + enum resctrl_res_level rid;
> char *name;
> bool configurable;
> - struct list_head list;
> + bool enabled;
> };
>
...
> -static void l3_mon_evt_init(struct rdt_resource *r)
> +struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
> + [QOS_L3_OCCUP_EVENT_ID] = {
> + .name = "llc_occupancy",
> + .evtid = QOS_L3_OCCUP_EVENT_ID,
> + .rid = RDT_RESOURCE_L3,
> + },
> + [QOS_L3_MBM_TOTAL_EVENT_ID] = {
> + .name = "mbm_total_bytes",
> + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
> + .rid = RDT_RESOURCE_L3,
> + },
> + [QOS_L3_MBM_LOCAL_EVENT_ID] = {
> + .name = "mbm_local_bytes",
> + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
> + .rid = RDT_RESOURCE_L3,
> + },
> +};
> +
> +void resctrl_enable_mon_event(enum resctrl_event_id evtid)
> {
> - INIT_LIST_HEAD(&r->evt_list);
> + if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
If the goal is range checking then there should be a lower limit also.
With the event IDs starting at 1 it could be useful to ensure that
the range check considers that. To help with this I think it will be
useful to introduce a new enum value, for example QOS_FIRST_EVENT,
that is the same value as QOS_L3_OCCUP_EVENT_ID, and use it in
range checking (more below).
> + return;
> + if (mon_event_all[evtid].enabled) {
> + pr_warn("Duplicate enable for event %d\n", evtid);
> + return;
> + }
>
> - if (resctrl_arch_is_llc_occupancy_enabled())
> - list_add_tail(&llc_occupancy_event.list, &r->evt_list);
> - if (resctrl_arch_is_mbm_total_enabled())
> - list_add_tail(&mbm_total_event.list, &r->evt_list);
> - if (resctrl_arch_is_mbm_local_enabled())
> - list_add_tail(&mbm_local_event.list, &r->evt_list);
> + mon_event_all[evtid].enabled = true;
> }
>
> /**
> @@ -900,15 +901,13 @@ int resctrl_mon_resource_init(void)
> if (ret)
> return ret;
>
> - l3_mon_evt_init(r);
> -
> if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> - mbm_total_event.configurable = true;
> + mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
> resctrl_file_fflags_init("mbm_total_bytes_config",
> RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
> }
> if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) {
> - mbm_local_event.configurable = true;
> + mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].configurable = true;
> resctrl_file_fflags_init("mbm_local_bytes_config",
> RFTYPE_MON_INFO | RFTYPE_RES_CACHE);
> }
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index cc37f58b47dd..69e0d40c4449 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -1150,7 +1150,9 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
> struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
> struct mon_evt *mevt;
>
> - list_for_each_entry(mevt, &r->evt_list, list) {
> + for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
This looks risky to have the pattern start with mon_event_all[0] when that
array entry is not fully initialized. With a first array entry of all zeroes
a blind copy of this pattern may trip on a false positive with 0 being a
valid resource ID. Here also I think it will help to have a new enum
value of, for example, QOS_FIRST_EVENT, and have the iteration start with it.
This could also be simplified with a helper, for example for_each_mon_evt()
or for_each_mon_event(), that can be used instead of open coding this. If you
do decide to do this it may be useful to rename for_each_mbm_event() to, for example,
for_each_mbm_event_id() to better reflect how that helper is different.
> + if (mevt->rid != r->rid || !mevt->enabled)
> + continue;
> seq_printf(seq, "%s\n", mevt->name);
> if (mevt->configurable)
> seq_printf(seq, "%s_config\n", mevt->name);
> @@ -3055,10 +3057,9 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> struct mon_evt *mevt;
> int ret, domid;
>
> - if (WARN_ON(list_empty(&r->evt_list)))
> - return -EPERM;
> -
> - list_for_each_entry(mevt, &r->evt_list, list) {
> + for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
> + if (mevt->rid != r->rid || !mevt->enabled)
> + continue;
> domid = do_sum ? d->ci->id : d->hdr.id;
> priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
> if (WARN_ON_ONCE(!priv))
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 02/29] x86,fs/resctrl: Replace architecture event enabled checks
2025-05-21 22:50 ` [PATCH v5 02/29] x86,fs/resctrl: Replace architecture event enabled checks Tony Luck
@ 2025-06-04 3:26 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:26 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
>
> @@ -877,6 +877,11 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid)
> mon_event_all[evtid].enabled = true;
> }
>
> +bool resctrl_is_mon_event_enabled(enum resctrl_event_id evtid)
> +{
> + return evtid < QOS_NUM_EVENTS && mon_event_all[evtid].enabled;
> +}
> +
Similar to previous patch the range check here looks to also benefit
from a new "QOS_FIRST_EVENT" or equivalent.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 03/29] x86/resctrl: Remove 'rdt_mon_features' global variable
2025-05-21 22:50 ` [PATCH v5 03/29] x86/resctrl: Remove 'rdt_mon_features' global variable Tony Luck
@ 2025-06-04 3:27 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:27 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> This variable was used as a bitmask of enabled monitor events. But
> that function is now provided by the filesystem mon_event_all[] array.
>
> Remove the remaining uses of this variable.
I do not see any reason why this changelog needs to be so obfuscated. Why go
through effort of referring to rdt_mon_features as "this variable" throughout
changelog when it could just be called by name? Same with "that function".
Compare with:
rdt_mon_features is used as a bitmask of enabled monitor events. A monitor
event's status is now maintained in mon_evt::enabled with all monitor
events' mon_evt structures found in the filesystem's mon_event_all[] array.
Remove the remaining uses of rdt_mon_features.
Patch looks good.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events
2025-05-21 22:50 ` [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events Tony Luck
2025-05-23 9:00 ` Peter Newman
@ 2025-06-04 3:29 ` Reinette Chatre
2025-06-07 0:45 ` Fenghua Yu
2 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:29 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> There's a rule in computer programming that objects appear zero,
> once, or many times. So code accordingly.
>
> There are two MBM events and resctrl is coded with a lot of
>
> if (local)
> do one thing
> if (total)
> do a different thing
>
> Change the rdt_mon_domain and rdt_hw_mon_domain structures to hold arrays
> of pointers to per event data instead of explicit fields for total and
> local bandwidth.
>
> Simplify by coding for many events using loops on which are enabled.
>
> Move resctrl_is_mbm_event() to <linux/resctrl.h> so it can be used more
> widely. Also provide a for_each_mbm_event() helper macro.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 15 +++++---
> include/linux/resctrl_types.h | 3 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 6 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 38 ++++++++++----------
> arch/x86/kernel/cpu/resctrl/monitor.c | 36 +++++++++----------
> fs/resctrl/monitor.c | 13 ++++---
> fs/resctrl/rdtgroup.c | 48 ++++++++++++--------------
> 7 files changed, 82 insertions(+), 77 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 843ad7c8e247..40f2d0d48d02 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -161,8 +161,7 @@ struct rdt_ctrl_domain {
> * @hdr: common header for different domain types
> * @ci: cache info for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> - * @mbm_total: saved state for MBM total bandwidth
> - * @mbm_local: saved state for MBM local bandwidth
> + * @mbm_states: saved state for each QOS MBM event
This can be more useful. For example:
Per-event pointer to the MBM event's saved state. An MBM event's state
is an array of struct mbm_state indexed by RMID on x86 or combined
CLOSID, RMID on Arm.
> * @mbm_over: worker to periodically read MBM h/w counters
> * @cqm_limbo: worker to periodically read CQM h/w counters
> * @mbm_work_cpu: worker CPU for MBM h/w counters
> @@ -172,8 +171,7 @@ struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> struct cacheinfo *ci;
> unsigned long *rmid_busy_llc;
> - struct mbm_state *mbm_total;
> - struct mbm_state *mbm_local;
> + struct mbm_state *mbm_states[QOS_NUM_L3_MBM_EVENTS];
> struct delayed_work mbm_over;
> struct delayed_work cqm_limbo;
> int mbm_work_cpu;
> @@ -376,6 +374,15 @@ bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
>
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
>
> +static inline bool resctrl_is_mbm_event(enum resctrl_event_id e)
nit: e -> evt (re. patch #1 comments)
> +{
> + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> + e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +#define for_each_mbm_event(evt) \
> + for (evt = QOS_L3_MBM_TOTAL_EVENT_ID; evt <= QOS_L3_MBM_LOCAL_EVENT_ID; evt++)
Please refer to comment in patch #1 about possible name change to"for_each_mbm_event_id()"
or similar.
> +
> /**
> * resctrl_arch_mon_event_config_write() - Write the config for an event.
> * @config_info: struct resctrl_mon_config_info describing the resource, domain
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index a25fb9c4070d..b468bfbab9ea 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -47,4 +47,7 @@ enum resctrl_event_id {
> QOS_NUM_EVENTS,
> };
>
> +#define QOS_NUM_L3_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> +#define MBM_STATE_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
> +
> #endif /* __LINUX_RESCTRL_TYPES_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 5e3c41b36437..ea185b4d0d59 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -54,15 +54,13 @@ struct rdt_hw_ctrl_domain {
> * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> * a resource for a monitor function
> * @d_resctrl: Properties exposed to the resctrl file system
> - * @arch_mbm_total: arch private state for MBM total bandwidth
> - * @arch_mbm_local: arch private state for MBM local bandwidth
> + * @arch_mbm_states: arch private state for each MBM event
> *
Can also be made more useful like previous example.
> * Members of this structure are accessed via helpers that provide abstraction.
> */
> struct rdt_hw_mon_domain {
> struct rdt_mon_domain d_resctrl;
> - struct arch_mbm_state *arch_mbm_total;
> - struct arch_mbm_state *arch_mbm_local;
> + struct arch_mbm_state *arch_mbm_states[QOS_NUM_L3_MBM_EVENTS];
> };
>
> static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
...
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index fda579251dba..85526e5540f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,18 +160,14 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
> u32 rmid,
> enum resctrl_event_id eventid)
> {
> - switch (eventid) {
> - case QOS_L3_OCCUP_EVENT_ID:
> - return NULL;
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &hw_dom->arch_mbm_total[rmid];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &hw_dom->arch_mbm_local[rmid];
> - default:
> - /* Never expect to get here */
> - WARN_ON_ONCE(1);
> + struct arch_mbm_state *state;
> +
> + if (!resctrl_is_mbm_event(eventid))
> return NULL;
> - }
> +
> + state = hw_dom->arch_mbm_states[MBM_STATE_IDX(eventid)];
> +
> + return state ? &state[rmid] : NULL;
> }
>
> void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> @@ -200,14 +196,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_total, 0,
> - sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_local, 0,
> - sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
> + enum resctrl_event_id evt;
> + int idx;
> +
> + for_each_mbm_event(evt) {
> + idx = MBM_STATE_IDX(evt);
> + if (!hw_dom->arch_mbm_states[idx])
> + continue;
Interesting change of pattern to switch from using "is event enabled" to
"does pointer to state" exist. This creates doubt that an enabled event
may not have its state allocated? The domain would not exist if the
state could not be allocated, no? Why the switch in check used?
> + memset(hw_dom->arch_mbm_states[idx], 0,
> + sizeof(struct arch_mbm_state) * r->num_rmid);
> + }
> }
>
> static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 325e23c1a403..4cd0789998bf 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -346,15 +346,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> u32 rmid, enum resctrl_event_id evtid)
> {
> u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> + struct mbm_state *states;
nit: The architectural and non-architectural states are managed the same way. Having the
two helpers look the same helps to enforce and explain this. It is thus unexpected that
one helper refers to the state as "state" (singular) while the other uses "states" (plural).
This causes reader to squint and try to see what the difference may be, but there is none.
>
> - switch (evtid) {
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &d->mbm_total[idx];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &d->mbm_local[idx];
> - default:
> + if (!resctrl_is_mbm_event(evtid))
> return NULL;
> - }
> +
> + states = d->mbm_states[MBM_STATE_IDX(evtid)];
> +
> + return states ? &states[idx] : NULL;
> }
>
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking
2025-05-21 22:50 ` [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking Tony Luck
@ 2025-06-04 3:31 ` Reinette Chatre
2025-06-04 22:58 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:31 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> The rdt_domain_hdr structure is used in both control and monitor
> domain structures to provide common methods for operations such as
> adding a CPU to a domain, removing a CPU from a domain, accessing
> the mask of all CPUs in a domain.
>
> The "type" field provides a simple check whether a domain is a
> control or monitor domain so that programming errors operating
> on domains will be quickly caught.
>
> To prepare for additional domain types that depend on the rdt_resource
> to which they are connected add the resource id into the header
> and check that in addition to the type.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 9 +++++++++
> arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++----
> fs/resctrl/ctrlmondata.c | 2 +-
> 3 files changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 40f2d0d48d02..d6b09952ef92 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -131,15 +131,24 @@ enum resctrl_domain_type {
> * @list: all instances of this resource
> * @id: unique id for this instance
> * @type: type of this instance
> + * @rid: index of resource for this domain
> * @cpu_mask: which CPUs share this resource
> */
> struct rdt_domain_hdr {
> struct list_head list;
> int id;
> enum resctrl_domain_type type;
> + enum resctrl_res_level rid;
> struct cpumask cpu_mask;
> };
>
> +static inline bool domain_header_is_valid(struct rdt_domain_hdr *hdr,
> + enum resctrl_domain_type type,
> + enum resctrl_res_level rid)
> +{
> + return !WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
> +}
> +
> /**
> * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
> * @hdr: common header for different domain types
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 4403a820db12..4983f6f81218 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -456,7 +456,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>
> hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
> if (hdr) {
> - if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> + if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> return;
> d = container_of(hdr, struct rdt_ctrl_domain, hdr);
>
This is quite subtle and not obvious until a few patches later that the
domain_header_is_valid() is done in preparation for using the
rdt_domain_hdr::rid to verify that the correct containing structure is
obtained in a subsequent container_of() call.
Patch #10 mentions it explicitly: "Add sanity checks where
container_of() is used to find the surrounding domain structure that
hdr has the expected type."
The change above, when combined with later changes, results in
code like:
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
/* handle failure */
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
...
Considering this all I do not think using a variable r->rid is appropriate
here. Specifically, if the code has it hardcoded that, for example,
the containing structure is "struct rdt_l3_mon_domain" then should the
test not similarly be hardcoded to ensure that rid is RDT_RESOURCE_L3?
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 07/29] x86,fs/resctrl: Rename some L3 specific functions
2025-05-21 22:50 ` [PATCH v5 07/29] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
@ 2025-06-04 3:32 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:32 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> All monitor events used to be connected to the L3 resource so
> it was OK for function names to be generic. But this will cause
> confusion with additional events tied to other resources.
>
> Rename functions that are only used for L3 features:
This does not rename all functions that are only used for L3 features.
Could you please add criteria used to decide which ones to rename?
>
> arch_mon_domain_online() -> arch_l3_mon_domain_online()
> mon_domain_free() -> l3_mon_domain_free()
This separates the alloc and free partner functions even more.
The partner, while not completely symmetrical, is arch_domain_mbm_alloc().
How about naming arch_domain_mbm_alloc() -> l3_mon_domain_mbm_alloc()
to at least be closer?
> domain_setup_mon_state() -> domain_setup_l3_mon_state
nit: domain_setup_l3_mon_state -> domain_setup_l3_mon_state()
This breaks symmetry with domain_destroy_mon_state(). Can domain_destroy_mon_state()
be renamed to domain_destroy_l3_mon_state()?
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 09/29] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-05-21 22:50 ` [PATCH v5 09/29] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-06-04 3:32 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:32 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> @@ -4041,6 +4041,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> if (resctrl_mounted && resctrl_arch_mon_capable())
> rmdir_mondata_subdir_allrdtgrp(r, d);
>
> + if (r->rid != RDT_RESOURCE_L3)
> + goto done;
resctrl does not use "do" or "done" terms in goto.
"grep goto fs/resctrl/rdtgroup.c" to get a summary of terms used by this file.
Typical is "out" or "out_<term>" where <term> is work jumped to, for example,
using "out_unlock" here will match other code.
> +
> if (resctrl_is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
> @@ -4057,7 +4060,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> }
>
> domain_destroy_mon_state(d);
> -
> +done:
> mutex_unlock(&rdtgroup_mutex);
> }
>
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr
2025-05-21 22:50 ` [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr Tony Luck
2025-05-22 0:01 ` Keshavamurthy, Anil S
@ 2025-06-04 3:37 ` Reinette Chatre
2025-06-07 0:52 ` Fenghua Yu
2 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:37 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Historically all monitoring events have been associated with the L3
> resource and it made sense to use "struct rdt_mon_domain *" arguments
> to functions manipulating domains. But the addition of monitor events
> tied to other resources changes this assumption.
>
> Some functionality like:
> *) adding a CPU to an existing domain
> *) removing a CPU that is not the last one from a domain
> can be achieved with just access to the rdt_domain_hdr structure.
>
> Change arguments from "rdt_*_domain" to rdt_domain_hdr so functions
> can be used on domains from any resource.
>
> Add sanity checks where container_of() is used to find the surrounding
> domain structure that hdr has the expected type.
>
> Simplify code that uses "d->hdr." to "hdr->" where possible.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 4 +-
> arch/x86/kernel/cpu/resctrl/core.c | 39 +++++++-------
> fs/resctrl/rdtgroup.c | 83 +++++++++++++++++++++---------
> 3 files changed, 79 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index d6b09952ef92..c02a4d59f3eb 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -444,9 +444,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type type);
> int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_online_cpu(unsigned int cpu);
> void resctrl_offline_cpu(unsigned int cpu);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index e4125161ffbd..71b884f25475 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -458,9 +458,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (hdr) {
> if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> return;
> - d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> -
> - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + cpumask_set_cpu(cpu, &hdr->cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> return;
> @@ -524,7 +522,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>
> list_add_tail_rcu(&d->hdr.list, add_pos);
>
> - err = resctrl_online_mon_domain(r, d);
> + err = resctrl_online_mon_domain(r, &d->hdr);
> if (err) {
> list_del_rcu(&d->hdr.list);
> synchronize_rcu();
> @@ -597,25 +595,24 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> return;
>
> + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> + if (!cpumask_empty(&hdr->cpu_mask))
> + return;
> +
> d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> hw_dom = resctrl_to_arch_ctrl_dom(d);
>
> - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> - if (cpumask_empty(&d->hdr.cpu_mask)) {
> - resctrl_offline_ctrl_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> - synchronize_rcu();
> -
> - /*
> - * rdt_ctrl_domain "d" is going to be freed below, so clear
> - * its pointer from pseudo_lock_region struct.
> - */
> - if (d->plr)
> - d->plr->d = NULL;
> - ctrl_domain_free(hw_dom);
> + resctrl_offline_ctrl_domain(r, d);
> + list_del_rcu(&hdr->list);
> + synchronize_rcu();
>
> - return;
> - }
> + /*
> + * rdt_ctrl_domain "d" is going to be freed below, so clear
> + * its pointer from pseudo_lock_region struct.
> + */
> + if (d->plr)
> + d->plr->d = NULL;
> + ctrl_domain_free(hw_dom);
> }
>
How does this hunk relate to the changelog? It seems like unrelated refactoring
to have have domain_remove_cpu_ctrl() have similar flow as domain_remove_cpu_mon()
after changes in previous patch.
> static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> @@ -651,8 +648,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> case RDT_RESOURCE_L3:
> d = container_of(hdr, struct rdt_mon_domain, hdr);
> hw_dom = resctrl_to_arch_mon_dom(d);
> - resctrl_offline_mon_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> + resctrl_offline_mon_domain(r, hdr);
> + list_del_rcu(&hdr->list);
> synchronize_rcu();
> l3_mon_domain_free(hw_dom);
> break;
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 828c743ec470..0213fb3a1113 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3022,7 +3022,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> * when last domain being summed is removed.
> */
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct rdtgroup *prgrp, *crgrp;
> char subname[32];
> @@ -3030,9 +3030,17 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> char name[32];
>
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> - if (snc_mode)
> - sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + if (snc_mode) {
> + struct rdt_mon_domain *d;
> +
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return;
Here is another example where I believe RDT_RESOURCE_L3 is more appropriate
than r->rid because SNC mode only applies to L3.
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> + sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
> + } else {
> + sprintf(name, "mon_%s_%02d", r->name, hdr->id);
> + }
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> @@ -3042,11 +3050,12 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> }
> }
>
> -static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp,
> bool do_sum)
> {
> struct rmid_read rr = {0};
> + struct rdt_mon_domain *d;
This may need an initialization here to eliminate the "may be used uninitialized" warning
during build.
> struct mon_data *priv;
> struct mon_evt *mevt;
> int ret, domid;
> @@ -3054,7 +3063,14 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
> if (mevt->rid != r->rid || !mevt->enabled)
> continue;
> - domid = do_sum ? d->ci->id : d->hdr.id;
> + if (r->rid == RDT_RESOURCE_L3) {
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
Considering the preceding "if()" ... r->rid is essentially RDT_RESOURCE_L3 and
can be made explicit.
> + return -EINVAL;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + domid = do_sum ? d->ci->id : d->hdr.id;
> + } else {
> + domid = hdr->id;
> + }
> priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
Another subtle change is that mon_get_kn_priv() was created for L3 resource
and now being subtly switched to be a generic utility.
Consider its last parameter documented as "Whether SNC summing monitors are
being created.". Surely that can never be set for any resource except L3.
Silently wedging things in like this makes this work difficult to consume.
At least the function's kernel-doc should change, it could benefit from
a warning if do_sum is ever true for a resource that is not L3.
Simlarly this work just silently also takes ownership of struct mon_data.
How does its @sum member apply here? That kernel-doc could also do with an
update stating when @sum can be expected to be valid. Increasingly subtle things
are left to the reader to decipher and it is looking more like this work aims
to wedge itself into resctrl instead of aiming to achieve clean integration.
> if (WARN_ON_ONCE(!priv))
> return -EINVAL;
> @@ -3063,18 +3079,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> if (ret)
> return ret;
>
> - if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
> - mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
> + if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
> + mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt->evtid, true);
> }
>
> return 0;
> }
>
> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> - struct rdt_mon_domain *d,
> + struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> struct kernfs_node *kn, *ckn;
> + struct rdt_mon_domain *d;
> char name[32];
> bool snc_mode;
> int ret = 0;
> @@ -3082,7 +3099,14 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> lockdep_assert_held(&rdtgroup_mutex);
>
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> + if (snc_mode) {
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return -EINVAL;
Same wrt explicit check using L3 resource.
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> + } else {
> + sprintf(name, "mon_%s_%02d", r->name, hdr->id);
> + }
> kn = kernfs_find_and_get(parent_kn, name);
> if (kn) {
> /*
> @@ -3098,13 +3122,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> ret = rdtgroup_kn_set_ugid(kn);
> if (ret)
> goto out_destroy;
> - ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
> + ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
> if (ret)
> goto out_destroy;
> }
>
> if (snc_mode) {
> - sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
> ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> if (IS_ERR(ckn)) {
> ret = -EINVAL;
> @@ -3115,7 +3139,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> if (ret)
> goto out_destroy;
>
> - ret = mon_add_all_files(ckn, d, r, prgrp, false);
> + ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
> if (ret)
> goto out_destroy;
> }
> @@ -3133,7 +3157,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> * and "monitor" groups with given domain id.
> */
> static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct kernfs_node *parent_kn;
> struct rdtgroup *prgrp, *crgrp;
> @@ -3141,12 +3165,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> parent_kn = prgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
>
> head = &prgrp->mon.crdtgrp_list;
> list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
> parent_kn = crgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, crgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
> }
> }
> }
> @@ -3155,14 +3179,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_resource *r,
> struct rdtgroup *prgrp)
> {
> - struct rdt_mon_domain *dom;
> + struct rdt_domain_hdr *hdr;
> int ret;
>
> /* Walking r->domains, ensure it can't race with cpuhp */
> lockdep_assert_cpus_held();
>
> - list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> - ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> + list_for_each_entry(hdr, &r->mon_domains, list) {
> + ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
> if (ret)
> return ret;
> }
> @@ -4030,8 +4054,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> + struct rdt_mon_domain *d;
> +
> mutex_lock(&rdtgroup_mutex);
>
> /*
> @@ -4039,11 +4065,15 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> * per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - rmdir_mondata_subdir_allrdtgrp(r, d);
> + rmdir_mondata_subdir_allrdtgrp(r, hdr);
>
> if (r->rid != RDT_RESOURCE_L3)
> goto done;
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
Again, no need to obfuscate things, considering earlier if(), this
can be explicit in check for RDT_RESOURCE_L3, no?
> + return;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> if (resctrl_is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
> @@ -4126,12 +4156,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
> return err;
> }
>
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> - int err;
> + struct rdt_mon_domain *d;
> + int err = -EINVAL;
>
> mutex_lock(&rdtgroup_mutex);
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + goto out_unlock;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
Similar here ... expecting the container_of() to require L3 resource
so the domain_header_is_valid() should be explicit for it. Making these
flows explicit makes the code much easier to understand.
> err = domain_setup_l3_mon_state(r, d);
> if (err)
> goto out_unlock;
> @@ -4152,7 +4187,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> * If resctrl is mounted, add per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - mkdir_mondata_subdir_allrdtgrp(r, d);
> + mkdir_mondata_subdir_allrdtgrp(r, hdr);
>
> out_unlock:
> mutex_unlock(&rdtgroup_mutex);
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 11/29] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-05-21 22:50 ` [PATCH v5 11/29] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-06-04 3:40 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:40 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Historically all monitoring events have been associated with the L3
> resource. This will change when support for telemetry events is added.
>
> The structures to track monitor domains at both the file system and
> architecture level have generic names. This may cause confusion when
> support for monitoring events in other resources is added.
>
> Rename by adding "l3_" into the names:
> rdt_mon_domain -> rdt_l3_mon_domain
> rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
>
> No functional change.
>
Related to question in patch #7 about criteria used to
decide which functions to rename to be L3 specific: This patch highlights
that there are many functions that take a L3 resource specific structure as
parameter yet the function names are not changed as part of the work in
patch #7.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 13/29] x86,fs/resctrl: Handle events that can be read from any CPU
2025-05-21 22:50 ` [PATCH v5 13/29] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
@ 2025-06-04 3:42 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:42 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Resctrl file system code was built with the assumption that monitor
> events can only be read from a CPU in the cpumask_t set for each
> domain.
>
> This was true for x86 events accessed with an MSR interface, but may
> not be true for other access methods such as MMIO.
>
> Add a flag to struct mon_evt to indicate which events can be read on
> any CPU.
Since struct mon_evt is per-event, how about:
"Add a flag to struct mon_evt to indicate if the event can be read on
any CPU."
>
> Architecture uses resctrl_enable_mon_event() to enable an event and
> set the flag appropriately.
>
> Bypass all the smp_call*() code for events that can be read on any CPU
> and call mon_event_count() directly from mon_event_read().
>
> Skip checks in __mon_event_count() that the read is being done from
> a CPU in the correct domain or cache scope.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 2 +-
> fs/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 6 +++---
> fs/resctrl/ctrlmondata.c | 7 ++++++-
> fs/resctrl/monitor.c | 26 ++++++++++++++++++++------
> 5 files changed, 32 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index b7a4c7bf4feb..9aab3d78005a 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -377,7 +377,7 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
> u32 resctrl_arch_system_num_rmid_idx(void);
> int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
>
> -void resctrl_enable_mon_event(enum resctrl_event_id evtid);
> +void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu);
>
> bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 085a2ee1922f..eb6e92d1ab15 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -57,6 +57,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> * @rid: index of the resource for this event
> * @name: name of the event
> * @configurable: true if the event is configurable
> + * @any_cpu: true if the event can be read from any CPU
> * @enabled: true if the event is enabled
> */
> struct mon_evt {
> @@ -64,6 +65,7 @@ struct mon_evt {
> enum resctrl_res_level rid;
> char *name;
> bool configurable;
> + bool any_cpu;
> bool enabled;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index b39537658618..5d9a024ce4b0 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -880,15 +880,15 @@ static __init bool get_rdt_mon_resources(void)
> bool ret = false;
>
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
> - resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
> + resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
> ret = true;
> }
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
> - resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
> + resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
> ret = true;
> }
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
> - resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
> + resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
> ret = true;
> }
>
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index dcde27f6f2ec..1337716f59c8 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -569,6 +569,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> return;
> }
>
> + if (evt->any_cpu) {
> + mon_event_count(rr);
> + goto done;
Please see earlier details about goto in resctrl. This can be "out_ctx_free".
> + }
> +
> cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
>
> /*
> @@ -581,7 +586,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> smp_call_function_any(cpumask, mon_event_count, rr, 1);
> else
> smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
> -
> +done:
> resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
> }
>
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 3cfd1bf1845e..e6e3be990638 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -356,9 +356,24 @@ static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
> return states ? &states[idx] : NULL;
> }
>
> +static bool cpu_on_wrong_domain(struct rmid_read *rr)
> +{
> + cpumask_t *mask;
> +
> + if (rr->evt->any_cpu)
> + return false;
> +
> + /*
> + * When reading from a specific domain the CPU must be in that
> + * domain. Otherwise the CPU must be one that shares the cache.
> + */
> + mask = rr->d ? &rr->d->hdr.cpu_mask : &rr->ci->shared_cpu_map;
> +
> + return !cpumask_test_cpu(smp_processor_id(), mask);
> +}
I find double-negatives can trip people up. Having function name contain
"wrong" while also returning "!" can be confusing. I think this will be simpler
if it is a straight-forward utility, for example, "cpu_on_correct_domain()"?
Maybe even "current_cpu_on_correct_domain()" to be explicit about which CPU
is being checked.
> +
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> {
> - int cpu = smp_processor_id();
> struct rdt_l3_mon_domain *d;
> struct mbm_state *m;
> int err, ret;
> @@ -373,8 +388,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> }
>
> if (rr->d) {
> - /* Reading a single domain, must be on a CPU in that domain. */
> - if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
> + if (cpu_on_wrong_domain(rr))
> return -EINVAL;
> rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
> rr->evt->evtid, &tval, rr->arch_mon_ctx);
> @@ -386,8 +400,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> return 0;
> }
>
> - /* Summing domains that share a cache, must be on a CPU for that cache. */
> - if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
> + if (cpu_on_wrong_domain(rr))
> return -EINVAL;
>
> /*
> @@ -865,7 +878,7 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
> },
> };
>
> -void resctrl_enable_mon_event(enum resctrl_event_id evtid)
> +void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
> {
> if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
> return;
> @@ -874,6 +887,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid)
> return;
> }
>
> + mon_event_all[evtid].any_cpu = any_cpu;
> mon_event_all[evtid].enabled = true;
> }
>
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters
2025-05-21 22:50 ` [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
@ 2025-06-04 3:49 ` Reinette Chatre
2025-06-06 16:25 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:49 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Resctrl was written with the assumption that all monitor events
> can be displayed as unsigned decimal integers.
>
> Hardware architecture counters may provide some telemetry events with
> greater precision where the event is not a simple count, but is a
> measurement of some sort (e.g. Joules for energy consumed).
>
> Add a new argument to resctrl_enable_mon_event() for architecture
> code to inform the file system that the value for a counter is
> a fixed-point value with a specific number of binary places.
resctrl fs contract with user space, per patch #29, is that only "core_energy"
and "activity" can be floating point. We do not want to make it possible for
an architecture to change this contract. Other events should not be able
to become floating point. I thus think there needs to be an extra setting that
indicates _if_ the architecture can specify a fraction.
>
> Fixed point values are displayed with values rounded to an
> appropriate number of decimal places.
How are the "appropriate number of decimal places" determined?
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 4 +-
> fs/resctrl/internal.h | 2 +
> arch/x86/kernel/cpu/resctrl/core.c | 6 +--
> fs/resctrl/ctrlmondata.c | 75 +++++++++++++++++++++++++++++-
> fs/resctrl/monitor.c | 5 +-
> 5 files changed, 85 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 9aab3d78005a..46ba62ee94a1 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -377,7 +377,9 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
> u32 resctrl_arch_system_num_rmid_idx(void);
> int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
>
> -void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu);
> +#define MAX_BINARY_BITS 27
> +
> +void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu, u32 binary_bits);
>
> bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index eb6e92d1ab15..d5045491790e 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -58,6 +58,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> * @name: name of the event
> * @configurable: true if the event is configurable
> * @any_cpu: true if the event can be read from any CPU
> + * @binary_bits: number of fixed-point binary bits from architecture
> * @enabled: true if the event is enabled
> */
> struct mon_evt {
> @@ -66,6 +67,7 @@ struct mon_evt {
> char *name;
> bool configurable;
> bool any_cpu;
> + int binary_bits;
> bool enabled;
> };
Perhaps a new member "is_floating_point" can be hardcoded by resctrl fs and only
events that are floating point can have their precision set?
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 5d9a024ce4b0..306afb50fd37 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -880,15 +880,15 @@ static __init bool get_rdt_mon_resources(void)
> bool ret = false;
>
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
> - resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
> + resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
We do not want architecture to be able to make these be floating point.
> ret = true;
> }
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
> - resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
> + resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
> ret = true;
> }
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
> - resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
> + resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
> ret = true;
> }
>
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index 1337716f59c8..07bf44834a46 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -590,6 +590,77 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
> }
>
> +/**
> + * struct fixed_params - parameters to decode a binary fixed point value
> + * @mask: Mask for fractional part of value.
> + * @lshift: Shift to round-up binary places.
> + * @pow10: Multiplier (10 ^ decimal places).
> + * @round: Add to round up to nearest decimal representation.
> + * @rshift: Shift back for final answer.
> + * @decplaces: Number of decimal places for this number of binary places.
> + */
> +struct fixed_params {
> + u64 mask;
> + int lshift;
> + int pow10;
> + u64 round;
> + int rshift;
> + int decplaces;
> +};
> +
> +static struct fixed_params fixed_params[MAX_BINARY_BITS + 1] = {
> + [1] = { GENMASK_ULL(1, 0), 0, 10, 0x00000000, 1, 1 },
> + [2] = { GENMASK_ULL(2, 0), 0, 100, 0x00000000, 2, 2 },
> + [3] = { GENMASK_ULL(3, 0), 0, 1000, 0x00000000, 3, 3 },
> + [4] = { GENMASK_ULL(4, 0), 2, 1000, 0x00000020, 6, 3 },
> + [5] = { GENMASK_ULL(5, 0), 1, 1000, 0x00000020, 6, 3 },
> + [6] = { GENMASK_ULL(6, 0), 0, 1000, 0x00000020, 6, 3 },
> + [7] = { GENMASK_ULL(7, 0), 2, 1000, 0x00000100, 9, 3 },
> + [8] = { GENMASK_ULL(8, 0), 1, 1000, 0x00000100, 9, 3 },
> + [9] = { GENMASK_ULL(9, 0), 0, 1000, 0x00000100, 9, 3 },
> + [10] = { GENMASK_ULL(10, 0), 2, 10000, 0x00000800, 12, 4 },
> + [11] = { GENMASK_ULL(11, 0), 1, 10000, 0x00000800, 12, 4 },
> + [12] = { GENMASK_ULL(12, 0), 0, 10000, 0x00000800, 12, 4 },
> + [13] = { GENMASK_ULL(13, 0), 2, 100000, 0x00004000, 15, 5 },
> + [14] = { GENMASK_ULL(14, 0), 1, 100000, 0x00004000, 15, 5 },
> + [15] = { GENMASK_ULL(15, 0), 0, 100000, 0x00004000, 15, 5 },
> + [16] = { GENMASK_ULL(16, 0), 2, 1000000, 0x00020000, 18, 6 },
> + [17] = { GENMASK_ULL(17, 0), 1, 1000000, 0x00020000, 18, 6 },
> + [18] = { GENMASK_ULL(18, 0), 0, 1000000, 0x00020000, 18, 6 },
> + [19] = { GENMASK_ULL(19, 0), 2, 10000000, 0x00100000, 21, 7 },
> + [20] = { GENMASK_ULL(20, 0), 1, 10000000, 0x00100000, 21, 7 },
> + [21] = { GENMASK_ULL(21, 0), 0, 10000000, 0x00100000, 21, 7 },
> + [22] = { GENMASK_ULL(22, 0), 2, 100000000, 0x00800000, 24, 8 },
> + [23] = { GENMASK_ULL(23, 0), 1, 100000000, 0x00800000, 24, 8 },
> + [24] = { GENMASK_ULL(24, 0), 0, 100000000, 0x00800000, 24, 8 },
> + [25] = { GENMASK_ULL(25, 0), 2, 1000000000, 0x04000000, 27, 9 },
> + [26] = { GENMASK_ULL(26, 0), 1, 1000000000, 0x04000000, 27, 9 },
> + [27] = { GENMASK_ULL(27, 0), 0, 1000000000, 0x04000000, 27, 9 }
> +};
> +
> +static void print_event_value(struct seq_file *m, int binary_bits, u64 val)
> +{
> + struct fixed_params *fp = &fixed_params[binary_bits];
> + unsigned long long frac;
> + char buf[10];
> +
> + frac = val & fp->mask;
> + frac <<= fp->lshift;
> + frac *= fp->pow10;
> + frac += fp->round;
> + frac >>= fp->rshift;
> +
Could you please document this algorithm? I wonder why lshift is necessary at all
and why rshift cannot just always be the fraction bits? Also note earlier question about
choice of decimal places.
> + sprintf(buf, "%0*llu", fp->decplaces, frac);
I'm a bit confused here. I see fp->decplaces as the field width and the "0" indicates
that the value is zero padded on the _left_. I interpret this to mean that, for example,
if the value of frac is 42 then it will be printed as "0042". The fraction's value is modified
(it is printed as "0.0042") and there are no trailing zeroes to remove. What am I missing?
> +
> + /* Trim trailing zeroes */
> + for (int i = fp->decplaces - 1; i > 0; i--) {
> + if (buf[i] != '0')
> + break;
> + buf[i] = '\0';
> + }
> + seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
> +}
> +
> int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> {
> struct kernfs_open_file *of = m->private;
> @@ -657,8 +728,10 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> seq_puts(m, "Error\n");
> else if (rr.err == -EINVAL)
> seq_puts(m, "Unavailable\n");
> - else
> + else if (evt->binary_bits == 0)
> seq_printf(m, "%llu\n", rr.val);
> + else
> + print_event_value(m, evt->binary_bits, rr.val);
>
> out:
> rdtgroup_kn_unlock(of->kn);
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index e6e3be990638..f554d7933739 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -878,9 +878,9 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
> },
> };
>
> -void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
> +void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu, u32 binary_bits)
> {
> - if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS))
> + if (WARN_ON_ONCE(evtid >= QOS_NUM_EVENTS) || binary_bits > MAX_BINARY_BITS)
> return;
> if (mon_event_all[evtid].enabled) {
> pr_warn("Duplicate enable for event %d\n", evtid);
> @@ -888,6 +888,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id evtid, bool any_cpu)
> }
>
> mon_event_all[evtid].any_cpu = any_cpu;
> + mon_event_all[evtid].binary_bits = binary_bits;
> mon_event_all[evtid].enabled = true;
> }
>
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 15/29] fs/resctrl: Add an architectural hook called for each mount
2025-05-21 22:50 ` [PATCH v5 15/29] fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-06-04 3:49 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:49 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 306afb50fd37..f8c9840ce7dc 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -710,6 +710,15 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
> return 0;
> }
>
> +void resctrl_arch_pre_mount(void)
> +{
> + static atomic_t only_once;
Looks like the custom is to initialize with ATOMIC_INIT(0).
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 17/29] x86/resctrl: Discover hardware telemetry events
2025-05-21 22:50 ` [PATCH v5 17/29] x86/resctrl: Discover hardware telemetry events Tony Luck
@ 2025-06-04 3:53 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:53 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Hardware has one or more telemetry event aggregators per package
> for each group of telemetry events. Each aggregator provides access
> to event counts in an array of 64-bit values in MMIO space. There
> is a "guid" (in this case a unique 32-bit integer) which refers to
> an XML file published in the https://github.com/intel/Intel-PMT
> that provides all the details about each aggregator.
>
> The XML files provide the following information:
> 1) Which telemetry events are included in the group for this aggregator.
> 2) The order in which the event counters appear for each RMID.
> 3) The value type of each event counter (integer or fixed-point).
> 4) The number of RMIDs supported.
> 5) Which additional aggregator status registers are included.
> 6) The total size of the MMIO region for this aggregator.
>
> There is an INTEL_PMT_DISCOVERY driver that enumerate all aggregators
> on the system with intel_pmt_get_regions_by_feature(). Call this for
> each pmt_feature_id that indicates per-RMID telemetry.
>
> Save the returned pmt_feature_group pointers with guids that are known
> to resctrl for use at run time.
>
> Those pointers are returned to the INTEL_PMT_DISCOVERY driver at
> resctrl_arch_exit() time.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +
> arch/x86/kernel/cpu/resctrl/core.c | 5 +
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 129 ++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/Makefile | 1 +
> 4 files changed, 138 insertions(+)
> create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 02c9e7d163dc..2b2d4b5a4643 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -167,4 +167,7 @@ void __init intel_rdt_mbm_apply_quirk(void);
>
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>
> +bool intel_aet_get_events(void);
> +void __exit intel_aet_exit(void);
> +
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index ce4885c751e4..64ce561e77a0 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -727,6 +727,9 @@ void resctrl_arch_pre_mount(void)
>
> if (!atomic_try_cmpxchg(&only_once, &old, 1))
> return;
> +
> + if (!intel_aet_get_events())
> + return;
> }
>
> enum {
> @@ -1079,6 +1082,8 @@ late_initcall(resctrl_arch_late_init);
>
> static void __exit resctrl_arch_exit(void)
> {
> + intel_aet_exit();
> +
> cpuhp_remove_state(rdt_online);
>
> resctrl_exit();
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> new file mode 100644
> index 000000000000..df73b9476c4d
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Resource Director Technology(RDT)
> + * - Intel Application Energy Telemetry
> + *
> + * Copyright (C) 2025 Intel Corporation
> + *
> + * Author:
> + * Tony Luck <tony.luck@intel.com>
> + */
> +
> +#define pr_fmt(fmt) "resctrl: " fmt
> +
> +#include <linux/cleanup.h>
> +#include <linux/cpu.h>
> +#include <linux/resctrl.h>
> +
> +/* Temporary - delete from final version */
> +#include "fake_intel_aet_features.h"
> +
> +#include "internal.h"
> +
> +/**
> + * struct event_group - All information about a group of telemetry events.
> + * @pfg: Points to the aggregated telemetry space information
> + * within the OOBMSM driver that contains data for all
> + * telemetry regions.
> + * @guid: Unique number per XML description file.
> + */
> +struct event_group {
> + /* Data fields used by this code. */
As opposed to data fields _not_ used by this code?
> + struct pmt_feature_group *pfg;
> +
> + /* Remaining fields initialized from XML file. */
> + u32 guid;
> +};
> +
> +/*
> + * Link: https://github.com/intel/Intel-PMT
> + * File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
> + */
> +static struct event_group energy_0x26696143 = {
> + .guid = 0x26696143,
> +};
> +
> +/*
> + * Link: https://github.com/intel/Intel-PMT
> + * File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
> + */
> +static struct event_group perf_0x26557651 = {
> + .guid = 0x26557651,
> +};
> +
> +static struct event_group *known_event_groups[] = {
> + &energy_0x26696143,
> + &perf_0x26557651,
> +};
> +
> +#define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
> +
> +/* Stub for now */
> +static int configure_events(struct event_group *e, struct pmt_feature_group *p)
> +{
> + return -EINVAL;
> +}
> +
> +DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
> + if (!IS_ERR_OR_NULL(_T)) \
> + intel_pmt_put_feature_group(_T))
> +
Line continuations are not necessary (checkpatch.pl)
> +/*
> + * Make a request to the INTEL_PMT_DISCOVERY driver for the
> + * pmt_feature_group for a specific feature. If there is
> + * one the returned structure has an array of telemetry_region
> + * structures. Each describes one telemetry aggregator.
> + * Try to configure any with a known matching guid.
> + */
> +static bool get_pmt_feature(enum pmt_feature_id feature)
> +{
> + struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
> + struct event_group **peg;
> + bool ret;
> +
> + p = intel_pmt_get_regions_by_feature(feature);
> +
> + if (IS_ERR_OR_NULL(p))
> + return false;
> +
What is the chance of p being valid here but there are no valid aggregators
for this system? (more below)
> + for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
> + for (int i = 0; i < p->count; i++) {
> + if ((*peg)->guid == p->regions[i].guid) {
At first this loop looks wrong since it seems to skip some aggregators by only running
configure_events() on the first aggregator found. After digging to understand what this does
it looks like unnecessary duplication to loop through aggregators here to determine if
configure_events() should be called and then to loop through aggregators again
within configure_events(). This is not obvious in this patch but comparing this
to what is coming in patch #18 I wonder if this cannot just be:
for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
ret = configure_events(*peg, p);
if (!ret) {
(*peg)->pfg = no_free_ptr(p);
return true;
}
}
In turn, configure_events() can contain:
for (int i = 0; i < p->count; i++) {
tr = &p->regions[i];
if (skip_this_region(tr, e))
continue;
if (!pkgcounts) {
pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
if (!pkgcounts)
return -ENOMEM;
}
pkgcounts[tr->plat_info.package_id]++;
}
if (!pkgcounts)
return -ENODEV; /* TBD error code */
> + ret = configure_events(*peg, p);
> + if (!ret) {
> + (*peg)->pfg = no_free_ptr(p);
> + return true;
> + }
> + break;
> + }
> + }
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Ask OOBMSM discovery driver for all the RMID based telemetry groups
> + * that it supports.
> + */
> +bool intel_aet_get_events(void)
> +{
> + bool ret1, ret2;
> +
> + ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM);
> + ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM);
> +
> + return ret1 || ret2;
> +}
> +
> +void __exit intel_aet_exit(void)
> +{
> + struct event_group **peg;
> +
> + for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
> + if ((*peg)->pfg) {
> + intel_pmt_put_feature_group((*peg)->pfg);
> + (*peg)->pfg = NULL;
> + }
> + }
> +}
> diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
> index cf4fac58d068..cca23f06d15d 100644
> --- a/arch/x86/kernel/cpu/resctrl/Makefile
> +++ b/arch/x86/kernel/cpu/resctrl/Makefile
> @@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
> obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
> +obj-$(CONFIG_X86_CPU_RESCTRL) += intel_aet.o
> obj-$(CONFIG_X86_CPU_RESCTRL) += fake_intel_aet_features.o
> obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
>
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 18/29] x86/resctrl: Count valid telemetry aggregators per package
2025-05-21 22:50 ` [PATCH v5 18/29] x86/resctrl: Count valid telemetry aggregators per package Tony Luck
@ 2025-06-04 3:54 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:54 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> There may be multiple telemetry aggregators per package, each enumerated
> by a telemetry region structure in the feature group.
>
> Scan the array of telemetry region structures and count how many are
> in each package in preparation to allocate structures to save the MMIO
> addresses for each in a convenient format for use when reading event
> counters.
>
> Sanity check that the telemetry region structures have a valid
> package_id and that the size they report for the MMIO space is as
> large as expected from the XML description of the registers in
> the region.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 45 ++++++++++++++++++++++++-
> 1 file changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index df73b9476c4d..ffcb54be54ea 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -14,6 +14,7 @@
> #include <linux/cleanup.h>
> #include <linux/cpu.h>
> #include <linux/resctrl.h>
> +#include <linux/slab.h>
>
> /* Temporary - delete from final version */
> #include "fake_intel_aet_features.h"
> @@ -26,6 +27,7 @@
> * within the OOBMSM driver that contains data for all
> * telemetry regions.
> * @guid: Unique number per XML description file.
> + * @mmio_size: Number of bytes of MMIO registers for this group.
> */
> struct event_group {
> /* Data fields used by this code. */
> @@ -33,6 +35,7 @@ struct event_group {
>
> /* Remaining fields initialized from XML file. */
> u32 guid;
> + size_t mmio_size;
> };
>
> /*
> @@ -41,6 +44,7 @@ struct event_group {
> */
> static struct event_group energy_0x26696143 = {
> .guid = 0x26696143,
> + .mmio_size = (576 * 2 + 3) * 8,
Could you please add a snippet to the struct description that
documents what these constants mean?
> };
>
> /*
> @@ -49,6 +53,7 @@ static struct event_group energy_0x26696143 = {
> */
> static struct event_group perf_0x26557651 = {
> .guid = 0x26557651,
> + .mmio_size = (576 * 7 + 3) * 8,
Same here.
> };
>
> static struct event_group *known_event_groups[] = {
> @@ -58,9 +63,47 @@ static struct event_group *known_event_groups[] = {
>
> #define NUM_KNOWN_GROUPS ARRAY_SIZE(known_event_groups)
>
> -/* Stub for now */
> +static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
> +{
> + if (tr->guid != e->guid)
> + return true;
> + if (tr->plat_info.package_id >= topology_max_packages()) {
> + pr_warn_once("Bad package %d in guid 0x%x\n", tr->plat_info.package_id,
> + tr->guid);
> + return true;
> + }
> + if (tr->size < e->mmio_size) {
> + pr_warn_once("MMIO space too small for guid 0x%x\n", e->guid);
With e->mmio_size hardcoded it may be useful to print the size claimed
by the telemetry region.
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Configure events from one pmt_feature_group.
> + * 1) Count how many per package.
> + * 2...) To be continued.
> + */
> static int configure_events(struct event_group *e, struct pmt_feature_group *p)
> {
> + int *pkgcounts __free(kfree) = NULL;
> + struct telemetry_region *tr;
> + int num_pkgs;
> +
> + num_pkgs = topology_max_packages();
> + pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
> + if (!pkgcounts)
> + return -ENOMEM;
> +
> + /* Get per-package counts of telemetry_regions for this event group */
> + for (int i = 0; i < p->count; i++) {
> + tr = &p->regions[i];
> + if (skip_this_region(tr, e))
> + continue;
> + pkgcounts[tr->plat_info.package_id]++;
> + }
> +
> return -EINVAL;
> }
>
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events
2025-05-21 22:50 ` [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
@ 2025-06-04 3:57 ` Reinette Chatre
2025-06-07 0:57 ` Fenghua Yu
1 sibling, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 3:57 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> +/**
> + * struct pmt_event - Telemetry event.
> + * @evtid: Resctrl event id
> + * @evt_idx: Counter index within each per-RMID block of counters
> + * @bin_bits: Zero for integer valued events, else number bits in fixed-point
> + */
> +struct pmt_event {
> + enum resctrl_event_id evtid;
> + int evt_idx;
> + int bin_bits;
> +};
It seems redundant to have "evt" in names of member when variable will
already have "evts" during access. Resulting in code like:
evts[i].evtid and evts[i].evt_idx when just
evts[i].id and evts[i].idx would be just as descriptive.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 21/29] x86/resctrl: x86/resctrl: Read core telemetry events
2025-05-21 22:50 ` [PATCH v5 21/29] x86/resctrl: x86/resctrl: Read core telemetry events Tony Luck
@ 2025-06-04 4:02 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 4:02 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
shortlog has a duplicate "x86/resctrl"
On 5/21/25 3:50 PM, Tony Luck wrote:
> The resctrl file system passes requests to read event monitor files to
> the architecture resctrl_arch_rmid_read() to collect values
> from hardware counters.
>
> Use the resctrl resource to differentiate between calls to read legacy
> L3 events from the new telemetry events (which are attached to
> RDT_RESOURCE_PERF_PKG).
>
> There may be multiple aggregators tracking each package, so scan all of
> them and add up all counters.
>
> At run time when a user reads an event file the file system code
> provides the enum resctrl_event_id for the event.
>
> Create a lookup table indexed by event id to provide the telem_entry
> structure and the event index into MMIO space.
First time asking whether the lookup table is needed:
V3: https://lore.kernel.org/lkml/7bb97892-16fd-49c5-90f0-223526ebdf4c@intel.com/
Reminder in V4 that question about need for lookup table is still unanswered:
https://lore.kernel.org/lkml/54291845-a964-4d6a-948c-6d6b14a705dd@intel.com/
Here goes my third attempt:
I still feel that a new lookup table is unnecessary. Looking at the new
structure introduced it unnecessarily duplicates the idx value from
struct pmt_event. As I proposed before, what if struct mon_evt gets
a void *priv that the architecture can set during event enable? resctrl
fs can then provide this pointer back to arch code when user attempts
to read the event.
In this implementation it looks like this void *priv of an event could
point to the event's struct pmt_event entry. The only thing that is missing
is the struct event_group pointer. Looking at previous patch every struct
pmt_event has a sequential index that makes it possible to determine &evts[0]
from any of the struct pmt_event pointers, enabling the use of
container_of() to determine the struct event_group pointer. What do you think?
I surely may be missing something, when I do, please use it as a teaching moment
instead of ignoring me. I spend a lot of time studying your work with the
goal to provide useful feedback. For this feedback to just be ignored makes me
feel like I am wasting my time.
> Enable the events marked as readable from any CPU.
>
> Resctrl now uses readq() so depends on X86_64. Update Kconfig.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 53 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 6 +++
> arch/x86/Kconfig | 2 +-
> 4 files changed, 61 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 2b2d4b5a4643..42da0a222c7c 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -169,5 +169,6 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>
> bool intel_aet_get_events(void);
> void __exit intel_aet_exit(void);
> +int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val);
>
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index bf8e2a6126d2..be52c9302a80 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -13,6 +13,7 @@
>
> #include <linux/cleanup.h>
> #include <linux/cpu.h>
> +#include <linux/io.h>
> #include <linux/resctrl.h>
> #include <linux/slab.h>
>
> @@ -128,6 +129,16 @@ static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
> return false;
> }
>
> +/**
> + * struct evtinfo - lookup table from resctrl_event_id to useful information
> + * @event_group: Pointer to the telem_entry structure for this event
My V4 question about "telem_entry structure" is still unanswered (note that changelog
also refers to "telem_entry structure):
https://lore.kernel.org/lkml/54291845-a964-4d6a-948c-6d6b14a705dd@intel.com/
After I apply this series I see:
$ git grep telem_entry
arch/x86/kernel/cpu/resctrl/intel_aet.c: * @event_group: Pointer to the telem_entry structure for this event
This is thus the only reference to "telem_entry" and I thus still
have the same question.
> + * @idx: Counter index within each per-RMID block of counters
> + */
> +static struct evtinfo {
> + struct event_group *event_group;
> + int idx;
> +} evtinfo[QOS_NUM_EVENTS];
> +
> static void free_mmio_info(struct mmio_info **mmi)
> {
> int num_pkgs = topology_max_packages();
> @@ -199,6 +210,15 @@ static int configure_events(struct event_group *e, struct pmt_feature_group *p)
> }
> e->pkginfo = no_free_ptr(pkginfo);
>
> + for (int i = 0; i < e->num_events; i++) {
> + enum resctrl_event_id evt;
> +
> + evt = e->evts[i].evtid;
> + evtinfo[evt].event_group = e;
> + evtinfo[evt].idx = e->evts[i].evt_idx;
> + resctrl_enable_mon_event(evt, true, e->evts[i].bin_bits);
> + }
> +
> return 0;
> }
>
> @@ -266,3 +286,36 @@ void __exit intel_aet_exit(void)
> free_mmio_info((*peg)->pkginfo);
> }
> }
> +
> +#define DATA_VALID BIT_ULL(63)
> +#define DATA_BITS GENMASK_ULL(62, 0)
> +
> +/*
> + * Read counter for an event on a domain (summing all aggregators
> + * on the domain).
> + */
> +int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val)
> +{
> + struct evtinfo *info = &evtinfo[evtid];
> + struct mmio_info *mmi;
> + u64 evtcount;
> + int idx;
> +
> + idx = rmid * info->event_group->num_events;
> + idx += info->idx;
> + mmi = info->event_group->pkginfo[domid];
> +
> + if (idx * sizeof(u64) + sizeof(u64) > info->event_group->mmio_size) {
> + pr_warn_once("MMIO index %d out of range\n", idx);
> + return -EIO;
> + }
> +
> + for (int i = 0; i < mmi->count; i++) {
> + evtcount = readq(mmi->addrs[i] + idx * sizeof(u64));
> + if (!(evtcount & DATA_VALID))
> + return -EINVAL;
> + *val += evtcount & DATA_BITS;
> + }
> +
> + return 0;
> +}
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 1f6dc253112f..c99aa9dacfd8 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -230,6 +230,12 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
>
> resctrl_arch_rmid_read_context_check();
>
> + if (r->rid == RDT_RESOURCE_PERF_PKG)
> + return intel_aet_read_event(d->hdr.id, rmid, eventid, val);
> +
This does not look right. As per the heading the function changed has the following signature:
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
The domain provided to the function is a pointer to a struct rdt_l3_mon_domain
so seeing this "r->rid == RDT_RESOURCE_PERF_PKG" test is unexpected because a
domain with type struct rdt_l3_mon_domain should not belong to a PERF_PKG resource.
Looks like that whole stack starting from rdtgroup_mondata_show() needs a second
look. Review of this work has not been going well and the skeptic in me is now
starting to think that the answer to my earlier question about why only a
subset of L3 resource specific functions are renamed is: "because if all L3
specific functions are renamed then it will be easier for reviewer to notice
when L3 specific functions are (ab)used for the PERF_PKG resource."
This extends to the data structures, the new events rely
on rdtgroup_mondata_show() to query the data and in turn rdtgroup_mondata_show()
relies on struct rmid_read that only has one domain pointer and it is to
struct rdt_l3_mon_domain.
> + if (r->rid != RDT_RESOURCE_L3)
> + return -EIO;
> +
> prmid = logical_rmid_to_physical_rmid(cpu, rmid);
> ret = __rmid_read_phys(prmid, eventid, &msr_val);
> if (ret)
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 52cfb69c343f..24df3f04a115 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -506,7 +506,7 @@ config X86_MPPARSE
>
> config X86_CPU_RESCTRL
> bool "x86 CPU resource control support"
> - depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> + depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> depends on MISC_FILESYSTEMS
> select ARCH_HAS_CPU_RESCTRL
> select RESCTRL_FS
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 19/29] x86/resctrl: Complete telemetry event enumeration
2025-05-21 22:50 ` [PATCH v5 19/29] x86/resctrl: Complete telemetry event enumeration Tony Luck
@ 2025-06-04 4:05 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 4:05 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Counters for telemetry events are in MMIO space. Each telemetry_region
> structure returned in the pmt_feature_group returned from OOBMSM
> contains the base MMIO address for the counters.
>
> Scan all the telemetry_region structures again and gather these
> addresses into a more convenient structure with addresses for
> each aggregator indexed by package id. Note that there may be
> multiple aggregators per package.
>
> Completed structure for each event group looks like this:
The depiction is very useful but please note that the above description
only equips reader with "a more convenient structure" to interpret
what below represents. It is a leap.
>
> +---------------------+---------------------+
> pkginfo** -->| pkginfo[0] | pkginfo[1] |
> +---------------------+---------------------+
> | |
> v v
> +----------------+ +----------------+
> |struct mmio_info| |struct mmio_info|
> +----------------+ +----------------+
> | count = N | | count = N |
I think renaming "count" to "num_regions" will help a lot to
explain this data.
> | addrs[0] | | addrs[0] |
> | addrs[0] | | addrs[0] |
[0] -> [1]
> | ... | | ... |
> | addrs[N-1] | | addrs[N-1] |
> +----------------+ +----------------+
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 65 ++++++++++++++++++++++++-
> 1 file changed, 64 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index ffcb54be54ea..2316198eb69e 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -21,17 +21,32 @@
>
> #include "internal.h"
>
> +/**
> + * struct mmio_info - Array of MMIO addresses for one event group for a package
"MMIO address information for one event group of a package."?
> + * @count: Number of addresses on this package
"@num_regions: Number of telemetry regions on this package."?
> + * @addrs: The MMIO addresses
"Array of MMIO addresses, one per telemetry region on this package."?
> + *
> + * Provides convenient access to all MMIO addresses of one event group
> + * for one package. Used when reading event data on a package.
> + */
> +struct mmio_info {
> + int count;
> + void __iomem *addrs[] __counted_by(count);
> +};
> +
> /**
> * struct event_group - All information about a group of telemetry events.
> * @pfg: Points to the aggregated telemetry space information
> * within the OOBMSM driver that contains data for all
> * telemetry regions.
> + * @pkginfo: Per-package MMIO addresses of telemetry regions belonging to this group
Sentence end with period.
> * @guid: Unique number per XML description file.
> * @mmio_size: Number of bytes of MMIO registers for this group.
> */
> struct event_group {
> /* Data fields used by this code. */
> struct pmt_feature_group *pfg;
> + struct mmio_info **pkginfo;
>
> /* Remaining fields initialized from XML file. */
> u32 guid;
> @@ -80,6 +95,20 @@ static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
> return false;
> }
>
> +static void free_mmio_info(struct mmio_info **mmi)
> +{
> + int num_pkgs = topology_max_packages();
> +
> + if (!mmi)
> + return;
> +
> + for (int i = 0; i < num_pkgs; i++)
> + kfree(mmi[i]);
> + kfree(mmi);
> +}
> +
> +DEFINE_FREE(mmio_info, struct mmio_info **, free_mmio_info(_T))
> +
> /*
> * Configure events from one pmt_feature_group.
> * 1) Count how many per package.
> @@ -87,10 +116,17 @@ static bool skip_this_region(struct telemetry_region *tr, struct event_group *e)
> */
> static int configure_events(struct event_group *e, struct pmt_feature_group *p)
> {
> + struct mmio_info __free(mmio_info) **pkginfo = NULL;
struct mmio_info **pkginfo __free(mmio_info) = NULL;
(checkpatch.pl)
> int *pkgcounts __free(kfree) = NULL;
> struct telemetry_region *tr;
> + struct mmio_info *mmi;
> int num_pkgs;
>
> + if (e->pkginfo) {
> + pr_warn("Duplicate telemetry information for guid 0x%x\n", e->guid);
This could be triggered by user space so may be safer with a "once".
> + return -EINVAL;
> + }
> +
> num_pkgs = topology_max_packages();
> pkgcounts = kcalloc(num_pkgs, sizeof(*pkgcounts), GFP_KERNEL);
> if (!pkgcounts)
> @@ -104,7 +140,33 @@ static int configure_events(struct event_group *e, struct pmt_feature_group *p)
> pkgcounts[tr->plat_info.package_id]++;
> }
>
> - return -EINVAL;
> + /* Allocate array for per-package struct mmio_info data */
> + pkginfo = kcalloc(num_pkgs, sizeof(*pkginfo), GFP_KERNEL);
> + if (!pkginfo)
> + return -ENOMEM;
> +
> + /*
> + * Allocate per-package mmio_info structures and initialize
> + * count of telemetry_regions in each one.
> + */
> + for (int i = 0; i < num_pkgs; i++) {
> + pkginfo[i] = kzalloc(struct_size(pkginfo[i], addrs, pkgcounts[i]), GFP_KERNEL);
> + if (!pkginfo[i])
> + return -ENOMEM;
> + pkginfo[i]->count = pkgcounts[i];
> + }
> +
> + /* Save MMIO address(es) for each telemetry region in per-package structures */
> + for (int i = 0; i < p->count; i++) {
> + tr = &p->regions[i];
> + if (skip_this_region(tr, e))
> + continue;
> + mmi = pkginfo[tr->plat_info.package_id];
> + mmi->addrs[--pkgcounts[tr->plat_info.package_id]] = tr->addr;
> + }
> + e->pkginfo = no_free_ptr(pkginfo);
> +
> + return 0;
> }
>
> DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *, \
> @@ -168,5 +230,6 @@ void __exit intel_aet_exit(void)
> intel_pmt_put_feature_group((*peg)->pfg);
> (*peg)->pfg = NULL;
> }
> + free_mmio_info((*peg)->pkginfo);
> }
> }
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-05-21 22:50 ` [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-06-04 4:06 ` Reinette Chatre
2025-06-07 0:54 ` Fenghua Yu
1 sibling, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 4:06 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> The L3 resource has several requirements for domains. There are structures
> that hold the 64-bit values of counters, and elements to keep track of
> the overflow and limbo threads.
>
> None of these are needed for the PERF_PKG resource. The hardware counters
> are wide enough that they do not wrap around for decades.
>
> Define a new rdt_perf_pkg_mon_domain structure which just consists of
> the standard rdt_domain_hdr to keep track of domain id and CPU mask.
>
> Change domain_add_cpu_mon(), domain_remove_cpu_mon(),
> resctrl_offline_mon_domain(), and resctrl_online_mon_domain() to check
> resource type and perform only the operations needed for domains in the
> PERF_PKG resource.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 41 ++++++++++++++++++++++++++++++
> fs/resctrl/rdtgroup.c | 4 +++
> 2 files changed, 45 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 64ce561e77a0..18d84c497ee4 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -540,6 +540,38 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
> }
> }
>
> +/**
> + * struct rdt_perf_pkg_mon_domain - CPUs sharing an Intel-PMT-scoped resctrl monitor resource
> + * @hdr: common header for different domain types
> + */
> +struct rdt_perf_pkg_mon_domain {
> + struct rdt_domain_hdr hdr;
> +};
> +
> +static void setup_intel_aet_mon_domain(int cpu, int id, struct rdt_resource *r,
> + struct list_head *add_pos)
> +{
> + struct rdt_perf_pkg_mon_domain *d;
> + int err;
> +
> + d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
> + if (!d)
> + return;
> +
> + d->hdr.id = id;
> + d->hdr.type = RESCTRL_MON_DOMAIN;
> + d->hdr.rid = r->rid;
> + cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + list_add_tail_rcu(&d->hdr.list, add_pos);
> +
> + err = resctrl_online_mon_domain(r, &d->hdr);
> + if (err) {
> + list_del_rcu(&d->hdr.list);
> + synchronize_rcu();
> + kfree(d);
> + }
> +}
> +
> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> @@ -567,6 +599,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> case RDT_RESOURCE_L3:
> l3_mon_domain_setup(cpu, id, r, add_pos);
> break;
> + case RDT_RESOURCE_PERF_PKG:
> + setup_intel_aet_mon_domain(cpu, id, r, add_pos);
> + break;
> default:
> WARN_ON_ONCE(1);
> }
> @@ -666,6 +701,12 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> default:
> pr_warn_once("Unknown resource rid=%d\n", r->rid);
> break;
Please keep "default" last.
> + case RDT_RESOURCE_PERF_PKG:
> + resctrl_offline_mon_domain(r, hdr);
> + list_del_rcu(&hdr->list);
> + synchronize_rcu();
> + kfree(container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr));
> + break;
> }
> }
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 1e1cc8001cbc..6078cdd5cad0 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -4170,6 +4170,8 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
> if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> goto out_unlock;
>
> + if (r->rid == RDT_RESOURCE_PERF_PKG)
> + goto do_mkdir;
Please move this "r->rid == RDT_RESOURCE_PERF_PKG" to be right after getting the mutex, there is
no reason to check the domain header for this resource. This enables the domain_header_is_valid()
check to use hardcoded RDT_RESOURCE_L3 as parameter to match the required L3 resource domain used
in container_of() below.
> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> err = domain_setup_l3_mon_state(r, d);
> if (err)
> @@ -4184,6 +4186,8 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
> INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
>
> +do_mkdir:
"do_mkdir" -> "mkdir"
> + err = 0;
> /*
> * If the filesystem is not mounted then only the default resource group
> * exists. Creation of its directories is deferred until mount time
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option
2025-05-21 22:50 ` [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
@ 2025-06-04 4:10 ` Reinette Chatre
2025-06-06 23:55 ` Fenghua Yu
1 sibling, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 4:10 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Users may want to force either of the telemetry features on
> (in the case where they are disabled due to erratum) or off
> (in the case that a limited number of RMIDs for a telemetry
> feature reduces the number of monitor groups that can be
> created.)
>
> Unlike other options that are tied to X86_FEATURE_* flags,
> these must be queried by name. Add a function to do that.
>
> Add checks for users who forced either feature off.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> .../admin-guide/kernel-parameters.txt | 2 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 4 +++
> arch/x86/kernel/cpu/resctrl/core.c | 28 +++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 6 ++++
> 4 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index d9fd26b95b34..4811bc812f0f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5988,7 +5988,7 @@
> rdt= [HW,X86,RDT]
> Turn on/off individual RDT features. List is:
> cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
> - mba, smba, bmec.
> + mba, smba, bmec, energy, perf.
> E.g. to turn on cmt and turn off mba use:
> rdt=cmt,!mba
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 42da0a222c7c..524f3c183900 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -167,6 +167,10 @@ void __init intel_rdt_mbm_apply_quirk(void);
>
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>
> +bool rdt_is_option_force_enabled(char *option);
> +
> +bool rdt_is_option_force_disabled(char *option);
> +
> bool intel_aet_get_events(void);
> void __exit intel_aet_exit(void);
> int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index f07f5b58639a..b23309566500 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -797,6 +797,8 @@ enum {
> RDT_FLAG_MBA,
> RDT_FLAG_SMBA,
> RDT_FLAG_BMEC,
> + RDT_FLAG_ENERGY,
> + RDT_FLAG_PERF,
> };
>
> #define RDT_OPT(idx, n, f) \
> @@ -822,6 +824,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
> RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
> RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
> RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
> + RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
> + RDT_OPT(RDT_FLAG_PERF, "perf", 0),
> };
> #define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
>
> @@ -871,6 +875,30 @@ bool rdt_cpu_has(int flag)
> return ret;
> }
>
> +bool rdt_is_option_force_enabled(char *name)
> +{
> + struct rdt_options *o;
> +
> + for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
> + if (!strcmp(name, o->name))
> + return o->force_on;
> + }
> +
> + return false;
> +}
> +
> +bool rdt_is_option_force_disabled(char *name)
> +{
> + struct rdt_options *o;
> +
> + for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
> + if (!strcmp(name, o->name))
> + return o->force_off;
> + }
> +
> + return false;
> +}
> +
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt)
> {
> if (!rdt_cpu_has(X86_FEATURE_BMEC))
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index be52c9302a80..c1fc85dbf0d8 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -51,6 +51,7 @@ struct pmt_event {
>
> /**
> * struct event_group - All information about a group of telemetry events.
> + * @name: Name for this group (used by boot rdt= option)
> * @pfg: Points to the aggregated telemetry space information
> * within the OOBMSM driver that contains data for all
> * telemetry regions.
> @@ -62,6 +63,7 @@ struct pmt_event {
> */
> struct event_group {
> /* Data fields used by this code. */
> + char *name;
> struct pmt_feature_group *pfg;
> struct mmio_info **pkginfo;
>
> @@ -77,6 +79,7 @@ struct event_group {
> * File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
> */
> static struct event_group energy_0x26696143 = {
> + .name = "energy",
> .guid = 0x26696143,
> .mmio_size = (576 * 2 + 3) * 8,
> .num_events = 2,
> @@ -91,6 +94,7 @@ static struct event_group energy_0x26696143 = {
> * File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
> */
> static struct event_group perf_0x26557651 = {
> + .name = "perf",
> .guid = 0x26557651,
> .mmio_size = (576 * 7 + 3) * 8,
> .num_events = 7,
> @@ -247,6 +251,8 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
> for (peg = &known_event_groups[0]; peg < &known_event_groups[NUM_KNOWN_GROUPS]; peg++) {
> for (int i = 0; i < p->count; i++) {
> if ((*peg)->guid == p->regions[i].guid) {
> + if (rdt_is_option_force_disabled((*peg)->name))
> + return false;
I do not see how this supports the "erratum" use case claimed in the changelog.
resctrl supports the "erratum" use case for hardware features and how it
works is that resctrl forces a feature off and those features can be forced on by
a user via the rdt= parameter. The changelog claims that this supports the same.
Consider the scenario when resctrl forces a feature off because of erratum,
for example, by doing something like __check_quirks_intel():
if (/* test if quirk applies */)
set_rdt_options(!perf);
The above will set "force_off" for the "perf" feature and no matter if the
user boots with "rdt=perf, the above rdt_is_option_force_disabled() will
return true since it does not check "force_on" that is set via kernel parameter.
Thus it is not possible for user to enable a feature that is forced off
because of erratum.
Note how rdt_cpu_has() deals with this:
if (o->force_off)
ret = false;
if (o->force_on)
ret = true;
rdt_cpu_has() deals with the "erratum" use case by first checking for "force_off"
that will capture the disable due to erratum, but then allows that to be changed
by checking "force_on" next and change the setting.
> ret = configure_events(*peg, p);
> if (!ret) {
> (*peg)->pfg = no_free_ptr(p);
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 25/29] x86/resctrl: Handle number of RMIDs supported by telemetry resources
2025-05-21 22:50 ` [PATCH v5 25/29] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
@ 2025-06-04 4:13 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 4:13 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> There are now three meanings for "number of RMIDs":
>
> 1) The number for legacy features enumerated by CPUID leaf 0xF. This
> is the maximum number of distinct values that can be loaded into the
> IA32_PQR_ASSOC MSR. Note that systems with Sub-NUMA Cluster mode enabled
> will force scaling down the CPUID enumerated value by the number of SNC
> nodes per L3-cache.
>
> 2) The number of registers in MMIO space for each event. This
> is enumerated in the XML files and is the value initialized into
> event_group::num_rmids. This will be overwritten with a lower
> value if hardware does not support all these registers at the
> same time (see next case).
>
> 3) The number of "h/w counters" (this isn't a strictly accurate
> description of how things work, but serves as a useful analogy that
> does describe the limitations) feeding to those MMIO registers. This
> is enumerated in telemetry_region::num_rmids returned from the call to
> intel_pmt_get_regions_by_feature()
>
> Event groups with insufficient "h/w counter" to track all RMIDs are
I'd like to highlight that the above sentence follows a section with
heading "There are now three meanings for "number of RMIDs":" ... following
such a section with a sentence that then refers to "all RMIDs" is bound
to make reader wonder which of the three RMIDs are being talked about.
Perhaps something like "Event groups with insufficient "h/w counter" to
track all values that can be loaded into the IA32_PQR_ASSOC MSR ..."
> difficult for users to use, since the system may reassign "h/w counters"
> as any time. This means that users cannot reliably collect two consecutive
> event counts to compute the rate at which events are occurring.
>
> Ignore such under-resourced event groups unless the user explicitly
> requests to enable them using the "rdt=" Linux boot argument.
>
> Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG
> resource "num_rmids" value to the smallest of these values to ensure
> that all resctrl groups have equal monitor capabilities.
>
> N.B. Changed type of rdt_resource::num_rmids to u32 to match.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 2 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 27 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 2 ++
> 4 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 4ba51cb598e1..b7e15abcde23 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -286,7 +286,7 @@ struct rdt_resource {
> int rid;
> bool alloc_capable;
> bool mon_capable;
> - int num_rmid;
> + u32 num_rmid;
> enum resctrl_scope ctrl_scope;
> enum resctrl_scope mon_scope;
> struct resctrl_cache cache;
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 524f3c183900..795534b9b9d2 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -18,6 +18,8 @@
>
> #define RMID_VAL_UNAVAIL BIT_ULL(62)
>
> +extern int rdt_num_system_rmids;
> +
> /*
> * With the above fields in use 62 bits remain in MSR_IA32_QM_CTR for
> * data to be returned. The counter width is discovered from the hardware
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index c1fc85dbf0d8..1b41167ad976 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -14,6 +14,7 @@
> #include <linux/cleanup.h>
> #include <linux/cpu.h>
> #include <linux/io.h>
> +#include <linux/minmax.h>
> #include <linux/resctrl.h>
> #include <linux/slab.h>
>
> @@ -57,6 +58,9 @@ struct pmt_event {
> * telemetry regions.
> * @pkginfo: Per-package MMIO addresses of telemetry regions belonging to this group
> * @guid: Unique number per XML description file.
> + * @num_rmids: Number of RMIDS supported by this group. Will be adjusted downwards
"Will be adjusted" -> "Adjusted"
> + * if enumeration from intel_pmt_get_regions_by_feature() indicates
> + * fewer RMIDs can be tracked simultaneously.
> * @mmio_size: Number of bytes of MMIO registers for this group.
> * @num_events: Number of events in this group.
> * @evts: Array of event descriptors.
> @@ -69,6 +73,7 @@ struct event_group {
>
> /* Remaining fields initialized from XML file. */
> u32 guid;
> + u32 num_rmids;
> size_t mmio_size;
> int num_events;
> struct pmt_event evts[] __counted_by(num_events);
> @@ -81,6 +86,7 @@ struct event_group {
> static struct event_group energy_0x26696143 = {
> .name = "energy",
> .guid = 0x26696143,
> + .num_rmids = 576,
> .mmio_size = (576 * 2 + 3) * 8,
> .num_events = 2,
> .evts = {
> @@ -96,6 +102,7 @@ static struct event_group energy_0x26696143 = {
> static struct event_group perf_0x26557651 = {
> .name = "perf",
> .guid = 0x26557651,
> + .num_rmids = 576,
> .mmio_size = (576 * 7 + 3) * 8,
> .num_events = 7,
> .evts = {
> @@ -253,6 +260,15 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
> if ((*peg)->guid == p->regions[i].guid) {
> if (rdt_is_option_force_disabled((*peg)->name))
> return false;
> + /*
> + * Ignore event group with insufficient RMIDs unless the
"insufficient RMIDs" -> "fewer RMIDs than can be loaded into the IA32_PQR_ASSOC MSR"?
Please feel free to improve.
> + * user used the rdt= boot option to specifically ask
> + * for it to be enabled.
> + */
> + if (p->regions[i].num_rmids < rdt_num_system_rmids &&
> + !rdt_is_option_force_enabled((*peg)->name))
> + return false;
> + (*peg)->num_rmids = min((*peg)->num_rmids, p->regions[i].num_rmids);
> ret = configure_events(*peg, p);
> if (!ret) {
> (*peg)->pfg = no_free_ptr(p);
> @@ -272,11 +288,22 @@ static bool get_pmt_feature(enum pmt_feature_id feature)
> */
> bool intel_aet_get_events(void)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> + struct event_group **eg;
> bool ret1, ret2;
>
> ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM);
> ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM);
>
> + for (eg = &known_event_groups[0]; eg < &known_event_groups[NUM_KNOWN_GROUPS]; eg++) {
> + if (!(*eg)->pfg)
> + continue;
> + if (r->num_rmid)
> + r->num_rmid = min(r->num_rmid, (*eg)->num_rmids);
> + else
> + r->num_rmid = (*eg)->num_rmids;
> + }
> +
> return ret1 || ret2;
> }
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c99aa9dacfd8..9cd37be262a2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -32,6 +32,7 @@ bool rdt_mon_capable;
>
> #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
>
> +int rdt_num_system_rmids;
> static int snc_nodes_per_l3_cache = 1;
>
> /*
> @@ -350,6 +351,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> + rdt_num_system_rmids = r->num_rmid;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-05-21 22:50 ` [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file Tony Luck
@ 2025-06-04 4:15 ` Reinette Chatre
2025-06-06 0:09 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 4:15 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 5/21/25 3:50 PM, Tony Luck wrote:
> Creation of all files in the resctrl file system is under control of
> the file system layer.
>
> But some resources may need to add a file to the info/{resource}
> directory for debug purposes.
>
> Add a new rdt_resource::info_file field for the resource to specify
> show() and/or write() operations. These will be called with the
> rdtgroup_mutex held.
>
> Architecture can note the file is only for debug using by setting
> the rftype::flags RFTYPE_DEBUG bit.
This needs to change. This punches a crater through the separation
between fs and arch that we worked hard to achieve. Please make an attempt
to do so as I am sure you can.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions
2025-06-04 3:25 ` Reinette Chatre
@ 2025-06-04 16:33 ` Luck, Tony
2025-06-04 18:24 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-04 16:33 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Tue, Jun 03, 2025 at 08:25:56PM -0700, Reinette Chatre wrote:
> Hi Tony,
> > +void resctrl_enable_mon_event(enum resctrl_event_id evtid);
> > +
> > bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
>
> nit: When code is consistent in name use it is easier to read.
> Above there is already resctrl_arch_is_evt_configurable() that uses "evt"
> as parameter name so naming the new parameter "evt" instead of "evtid"
> will be much easier on the eye to make clear that this is the "same thing".
> Also later, when resctrl_is_mbm_event() is moved it will be clean to have
> it also use "evt" as parameter name and not end up with three different
> "evtid", "evt", and "e" for these related functions.
Should I also clean up existing muddled naming? Upstream has the
following names for parameters and local variables of type enum
resctrl_event_id (counts for number of occurrences of each):
6 eventid
2 evt
1 evt_id
3 evtid
2 mba_mbps_default_event
1 mba_mbps_event
It seems that "eventid" is the most popular of existing uses.
Also seems the most descriptive.
Perhaps "mevt" would be a good standard choice for "struct mon_evt *mevt"?
Upstream uses this three times, but I add some extra using "*evt".
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions
2025-06-04 16:33 ` Luck, Tony
@ 2025-06-04 18:24 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 18:24 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 6/4/25 9:33 AM, Luck, Tony wrote:
> On Tue, Jun 03, 2025 at 08:25:56PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>> +void resctrl_enable_mon_event(enum resctrl_event_id evtid);
>>> +
>>> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
>>
>> nit: When code is consistent in name use it is easier to read.
>> Above there is already resctrl_arch_is_evt_configurable() that uses "evt"
>> as parameter name so naming the new parameter "evt" instead of "evtid"
>> will be much easier on the eye to make clear that this is the "same thing".
>> Also later, when resctrl_is_mbm_event() is moved it will be clean to have
>> it also use "evt" as parameter name and not end up with three different
>> "evtid", "evt", and "e" for these related functions.
>
> Should I also clean up existing muddled naming?
I think that matching code in same area and not making things more messy is
in the scope of this series. Cleaning up resctrl's variable names is not related
to this work. If you instead find that doing this cleanup simplifies your
contribution then you are welcome to add such changes.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking
2025-06-04 3:31 ` Reinette Chatre
@ 2025-06-04 22:58 ` Luck, Tony
2025-06-04 23:40 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-04 22:58 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Tue, Jun 03, 2025 at 08:31:07PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/21/25 3:50 PM, Tony Luck wrote:
> > The rdt_domain_hdr structure is used in both control and monitor
> > domain structures to provide common methods for operations such as
> > adding a CPU to a domain, removing a CPU from a domain, accessing
> > the mask of all CPUs in a domain.
> >
> > The "type" field provides a simple check whether a domain is a
> > control or monitor domain so that programming errors operating
> > on domains will be quickly caught.
> >
> > To prepare for additional domain types that depend on the rdt_resource
> > to which they are connected add the resource id into the header
> > and check that in addition to the type.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > include/linux/resctrl.h | 9 +++++++++
> > arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++----
> > fs/resctrl/ctrlmondata.c | 2 +-
> > 3 files changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 40f2d0d48d02..d6b09952ef92 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -131,15 +131,24 @@ enum resctrl_domain_type {
> > * @list: all instances of this resource
> > * @id: unique id for this instance
> > * @type: type of this instance
> > + * @rid: index of resource for this domain
> > * @cpu_mask: which CPUs share this resource
> > */
> > struct rdt_domain_hdr {
> > struct list_head list;
> > int id;
> > enum resctrl_domain_type type;
> > + enum resctrl_res_level rid;
> > struct cpumask cpu_mask;
> > };
> >
> > +static inline bool domain_header_is_valid(struct rdt_domain_hdr *hdr,
> > + enum resctrl_domain_type type,
> > + enum resctrl_res_level rid)
> > +{
> > + return !WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
> > +}
> > +
> > /**
> > * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
> > * @hdr: common header for different domain types
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 4403a820db12..4983f6f81218 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -456,7 +456,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> >
> > hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
> > if (hdr) {
> > - if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
> > + if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> > return;
This type check was added as part of the split of the rdt_domain
structure into sepaarte ctrl and mon structures. I think the concern
was that some code might look at the wrong rdt_resource list and
try to operate on a ctrl domain structure that is actually a mon
structure (or vice versa). This felt like a real possibility.
Extending this to save and check the resource id seemed like a
natural extension at the time. But I'm starting to doubt the value
of doing so.
For this new check to ever fail we would have to somehow add
a domain for some resource type to a list on a different
rdt_resource structure. I'm struggling to see how such an
error could ever occur. Domains are only added to an rdt_resource
list by one of domain_add_cpu_ctrl() or domain_add_cpu_mon().
But these same functions are the ones to allocate the domain
structure and initialize the "d->hdr.id" field a dozen or so
lines earlier in the function.
Note that I'm not disputing your comments where my patches
are still passing a rdt_l3_mon_domain structure down through
several levels of function calls only to do:
if (r->rid == RDT_RESOURCE_PERF_PKG)
return intel_aet_read_event(d->hdr.id, rmid, eventid, val);
revealing that it wasn't an rdt_l3_mon_domain at all!
But these domain_header_is_valid() checks didn't help uncover
that.
Bottom line: I'd like to just keep the "type" check and not
extend to check the resource id.
> > d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> >
>
> This is quite subtle and not obvious until a few patches later that the
> domain_header_is_valid() is done in preparation for using the
> rdt_domain_hdr::rid to verify that the correct containing structure is
> obtained in a subsequent container_of() call.
>
> Patch #10 mentions it explicitly: "Add sanity checks where
> container_of() is used to find the surrounding domain structure that
> hdr has the expected type."
>
> The change above, when combined with later changes, results in
> code like:
>
> if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> /* handle failure */
>
> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> ...
>
> Considering this all I do not think using a variable r->rid is appropriate
> here. Specifically, if the code has it hardcoded that, for example,
> the containing structure is "struct rdt_l3_mon_domain" then should the
> test not similarly be hardcoded to ensure that rid is RDT_RESOURCE_L3?
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking
2025-06-04 22:58 ` Luck, Tony
@ 2025-06-04 23:40 ` Reinette Chatre
0 siblings, 0 replies; 90+ messages in thread
From: Reinette Chatre @ 2025-06-04 23:40 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 6/4/25 3:58 PM, Luck, Tony wrote:
> On Tue, Jun 03, 2025 at 08:31:07PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/21/25 3:50 PM, Tony Luck wrote:
>>> The rdt_domain_hdr structure is used in both control and monitor
>>> domain structures to provide common methods for operations such as
>>> adding a CPU to a domain, removing a CPU from a domain, accessing
>>> the mask of all CPUs in a domain.
>>>
>>> The "type" field provides a simple check whether a domain is a
>>> control or monitor domain so that programming errors operating
>>> on domains will be quickly caught.
>>>
>>> To prepare for additional domain types that depend on the rdt_resource
>>> to which they are connected add the resource id into the header
>>> and check that in addition to the type.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>> include/linux/resctrl.h | 9 +++++++++
>>> arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++----
>>> fs/resctrl/ctrlmondata.c | 2 +-
>>> 3 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 40f2d0d48d02..d6b09952ef92 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -131,15 +131,24 @@ enum resctrl_domain_type {
>>> * @list: all instances of this resource
>>> * @id: unique id for this instance
>>> * @type: type of this instance
>>> + * @rid: index of resource for this domain
>>> * @cpu_mask: which CPUs share this resource
>>> */
>>> struct rdt_domain_hdr {
>>> struct list_head list;
>>> int id;
>>> enum resctrl_domain_type type;
>>> + enum resctrl_res_level rid;
>>> struct cpumask cpu_mask;
>>> };
>>>
>>> +static inline bool domain_header_is_valid(struct rdt_domain_hdr *hdr,
>>> + enum resctrl_domain_type type,
>>> + enum resctrl_res_level rid)
>>> +{
>>> + return !WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
>>> +}
>>> +
>>> /**
>>> * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
>>> * @hdr: common header for different domain types
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index 4403a820db12..4983f6f81218 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -456,7 +456,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
>>>
>>> hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
>>> if (hdr) {
>>> - if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
>>> + if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
>>> return;
>
> This type check was added as part of the split of the rdt_domain
> structure into sepaarte ctrl and mon structures. I think the concern
> was that some code might look at the wrong rdt_resource list and
> try to operate on a ctrl domain structure that is actually a mon
> structure (or vice versa). This felt like a real possibility.
>
> Extending this to save and check the resource id seemed like a
> natural extension at the time. But I'm starting to doubt the value
> of doing so.
>
> For this new check to ever fail we would have to somehow add
> a domain for some resource type to a list on a different
> rdt_resource structure. I'm struggling to see how such an
I disagree with this statement. I do not see the failure as related
to the list to which the domain belongs but instead related to how
functions interpret a domain passed to it. There are a couple of functions
that are provided a domain pointer and the function is hardcoded to expect
the domain pointed to to be of a specific type.
For example, rdtgroup_mondata_show() is hardcoded to work with an
L3 resource domain. If my suggestion here is followed then
rdtgroup_mondata_show() would contain the specific check below because
it interprets the domain as belonging to a L3 resource:
domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)
With the check used as above the current issue would be exposed.
> error could ever occur. Domains are only added to an rdt_resource
> list by one of domain_add_cpu_ctrl() or domain_add_cpu_mon().
> But these same functions are the ones to allocate the domain
> structure and initialize the "d->hdr.id" field a dozen or so
> lines earlier in the function.
>
> Note that I'm not disputing your comments where my patches
> are still passing a rdt_l3_mon_domain structure down through
> several levels of function calls only to do:
>
> if (r->rid == RDT_RESOURCE_PERF_PKG)
> return intel_aet_read_event(d->hdr.id, rmid, eventid, val);
>
> revealing that it wasn't an rdt_l3_mon_domain at all!
>
> But these domain_header_is_valid() checks didn't help uncover
> that.
This is not because of the check itself but how it is used in this version
... it essentially gave the check the wrong value to check against.
> Bottom line: I'd like to just keep the "type" check and not
> extend to check the resource id.
Pointers to domains of different types are passed around (irrespective
of the list they belong to) but required to be of particular type
when acted on. The way I see it this check is required if this
design continues. If used correctly in this implementation it will help
to expose those places where L3 domain specific functions are used as
"generic" to operate on PERF_PKG domains.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-04 4:15 ` Reinette Chatre
@ 2025-06-06 0:09 ` Luck, Tony
2025-06-06 16:26 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-06 0:09 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Tue, Jun 03, 2025 at 09:15:02PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/21/25 3:50 PM, Tony Luck wrote:
> > Creation of all files in the resctrl file system is under control of
> > the file system layer.
> >
> > But some resources may need to add a file to the info/{resource}
> > directory for debug purposes.
> >
> > Add a new rdt_resource::info_file field for the resource to specify
> > show() and/or write() operations. These will be called with the
> > rdtgroup_mutex held.
> >
> > Architecture can note the file is only for debug using by setting
> > the rftype::flags RFTYPE_DEBUG bit.
>
> This needs to change. This punches a crater through the separation
> between fs and arch that we worked hard to achieve. Please make an attempt
> to do so as I am sure you can.
The file I want to create here amy only be of interest in debugging
the telemetry h/w interface. So my next choice is debugfs.
But creation of the debugfs "resctrl" directory is done by file
system code and the debugfs_resctrl variable is only marked "extern" by
fs/resctrl/internal.h, so currently not accessible to architecture code.
Is that a deliberate choice? Would it be OK to make that visible to
architecture code to create files in /sys/kernel/debug/resctrl?
Or should I add my file in a new /sys/kernel/debug/x86/resctrl
directory?
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters
2025-06-04 3:49 ` Reinette Chatre
@ 2025-06-06 16:25 ` Luck, Tony
2025-06-06 16:56 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-06 16:25 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Tue, Jun 03, 2025 at 08:49:08PM -0700, Reinette Chatre wrote:
> > + sprintf(buf, "%0*llu", fp->decplaces, frac);
>
> I'm a bit confused here. I see fp->decplaces as the field width and the "0" indicates
> that the value is zero padded on the _left_. I interpret this to mean that, for example,
> if the value of frac is 42 then it will be printed as "0042". The fraction's value is modified
> (it is printed as "0.0042") and there are no trailing zeroes to remove. What am I missing?
An example may help. Suppose architecture is providing 18 binary place
numbers, and delivers the value 0x60000 to be displayed. With 18 binary
places filesystem chooses 6 decimal places (I'll document the rationale
for this choice in comments in next version). In binary the value looks
like this:
integer binary_places
1 100000000000000000
Running through my algorithm will end with "frac" = 500000 (decimal).
Thus there are *trailing* zeroes. The value should be displayed as
"1.5" not as "1.500000".
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-06 0:09 ` Luck, Tony
@ 2025-06-06 16:26 ` Reinette Chatre
2025-06-06 17:30 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-06 16:26 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 6/5/25 5:09 PM, Luck, Tony wrote:
> On Tue, Jun 03, 2025 at 09:15:02PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/21/25 3:50 PM, Tony Luck wrote:
>>> Creation of all files in the resctrl file system is under control of
>>> the file system layer.
>>>
>>> But some resources may need to add a file to the info/{resource}
>>> directory for debug purposes.
>>>
>>> Add a new rdt_resource::info_file field for the resource to specify
>>> show() and/or write() operations. These will be called with the
>>> rdtgroup_mutex held.
>>>
>>> Architecture can note the file is only for debug using by setting
>>> the rftype::flags RFTYPE_DEBUG bit.
>>
>> This needs to change. This punches a crater through the separation
>> between fs and arch that we worked hard to achieve. Please make an attempt
>> to do so as I am sure you can.
>
> The file I want to create here amy only be of interest in debugging
> the telemetry h/w interface. So my next choice is debugfs.
I believe we can make either work but debugfs does indeed sound most
appropriate. I am not familiar with customs surrounding kernel support for
hardware debug interfaces though, so setting that aside for now.
>
> But creation of the debugfs "resctrl" directory is done by file
> system code and the debugfs_resctrl variable is only marked "extern" by
> fs/resctrl/internal.h, so currently not accessible to architecture code.
It would be best to stay this way.
>
> Is that a deliberate choice? Would it be OK to make that visible to
> architecture code to create files in /sys/kernel/debug/resctrl?
It should not be necessary to give an arch total control over
resctrl fs's debugfs.
>
> Or should I add my file in a new /sys/kernel/debug/x86/resctrl
> directory?
Since there is already a resctrl debugfs I think it will be less confusing
to keep all resctrl debug together within /sys/kernel/debug/resctrl.
There should be some bounds on what an arch can do here though.
There is already support for debugfs via the pseudo-locking
work where /sys/kernel/debug/resctrl contains directories with name
of resource group to provide debug for specific resource group. Giving
an arch total freedom on what can be created in /sys/kernel/debug/resctrl
thus runs the risk of exposing to corner cases where name of arch
debug cannot match resource group names and with ordering of these
directory creations it can become tricky.
With /sys/kernel/debug/resctrl potentially mirroring /sys/fs/resctrl to
support various debugging scenarios there may later be resource level
debugging for which a "/sys/kernel/debug/resctrl/info/<resource>/<debugfile>" can
be used. Considering this it looks to me as though one possible boundary could
be to isolate arch specific debug to, for example, a new directory named
"/sys/kernel/debug/resctrl/info/arch_debug_name_tbd/". By placing the
arch debug in a sub-directory named "info" it avoids collision with resource
group names with naming that also avoids collision with resource names since
all these names are controlled by resctrl fs.
To support this an architecture can request resctrl fs to create such directory
after resctrl_init() succeeds. Since it is custom to ignore errors for
debugfs dir creation the call is not expected to interfere with initialization
and existing arch initialization should not be impacted with this call
being after resctrl__init().
For example, for x86 it could be something like:
struct dentry *arch_priv_debug_fs_dir;
resctrl_arch_late_init(void) {
...
ret = resctrl_init();
if (ret) {
cpuhp_remove_state(state);
return ret;
}
rdt_online = state;
arch_priv_debug_fs_dir = resctrl_create_arch_debugfs(); /* all names up for improvement */
...
}
Note that the architecture does not pick the directory name.
On the resctrl fs side, resctrl_create_arch_debugfs() creates
"/sys/kernel/debug/resctrl/info/arch_debug_name_tbd", passing the
dentry back to the arch and with this gives the arch flexibility to create
new directories and files to match its debug requirements.
The arch has flexibility to manage the lifetimes of files/directories of
its debugfs area while resctrl fs still has the final control
with the debugfs_remove_recursive() of the top level /sys/kernel/debug/resctrl
on its exit.
What do you think?
It will need more work to support an arch-specific debug that is
connected to a resource or resource group, but this does not seem to be the
goal here. Even so, with a design like this it keeps the door open for
future resctrl fs and/or arch debug associated with resources and
resource groups.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters
2025-06-06 16:25 ` Luck, Tony
@ 2025-06-06 16:56 ` Reinette Chatre
2025-06-10 15:16 ` Dave Martin
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-06 16:56 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 6/6/25 9:25 AM, Luck, Tony wrote:
> On Tue, Jun 03, 2025 at 08:49:08PM -0700, Reinette Chatre wrote:
>>> + sprintf(buf, "%0*llu", fp->decplaces, frac);
>>
>> I'm a bit confused here. I see fp->decplaces as the field width and the "0" indicates
>> that the value is zero padded on the _left_. I interpret this to mean that, for example,
>> if the value of frac is 42 then it will be printed as "0042". The fraction's value is modified
>> (it is printed as "0.0042") and there are no trailing zeroes to remove. What am I missing?
>
> An example may help. Suppose architecture is providing 18 binary place
> numbers, and delivers the value 0x60000 to be displayed. With 18 binary
> places filesystem chooses 6 decimal places (I'll document the rationale
> for this choice in comments in next version). In binary the value looks
> like this:
>
> integer binary_places
> 1 100000000000000000
>
> Running through my algorithm will end with "frac" = 500000 (decimal).
>
> Thus there are *trailing* zeroes. The value should be displayed as
> "1.5" not as "1.500000".
Instead of a counter example, could you please make it obvious through
the algorithm description and/or explanation of decimal place choice how
"frac" is guaranteed to never be smaller than "decplaces"?
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-06 16:26 ` Reinette Chatre
@ 2025-06-06 17:30 ` Luck, Tony
2025-06-06 21:14 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-06 17:30 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 09:26:06AM -0700, Reinette Chatre wrote:
> With /sys/kernel/debug/resctrl potentially mirroring /sys/fs/resctrl to
> support various debugging scenarios there may later be resource level
> debugging for which a "/sys/kernel/debug/resctrl/info/<resource>/<debugfile>" can
> be used. Considering this it looks to me as though one possible boundary could
> be to isolate arch specific debug to, for example, a new directory named
> "/sys/kernel/debug/resctrl/info/arch_debug_name_tbd/". By placing the
> arch debug in a sub-directory named "info" it avoids collision with resource
> group names with naming that also avoids collision with resource names since
> all these names are controlled by resctrl fs.
That seems like a good path. PoC patch below. Note that I put the dentry
for the debug info directory into struct rdt_resource. So no call from
architecture to file system code needed to access.
Directory layout looks like this:
# tree /sys/kernel/debug/resctrl/
/sys/kernel/debug/resctrl/
└── info
├── L2
├── L3
├── MB
└── SMBA
6 directories, 0 files
-Tony
---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 5e28e81b35f6..78dd0f8f7ad8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -281,6 +281,7 @@ enum resctrl_schema_fmt {
* @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth
* monitoring events can be configured.
* @cdp_capable: Is the CDP feature available on this resource
+ * @arch_debug_info: Debugfs info directory for architecture use
*/
struct rdt_resource {
int rid;
@@ -297,6 +298,7 @@ struct rdt_resource {
enum resctrl_schema_fmt schema_fmt;
unsigned int mbm_cfg_mask;
bool cdp_capable;
+ struct dentry *arch_debug_info;
};
/*
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index ed4fc45da346..48c587201fb6 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4274,6 +4274,8 @@ void resctrl_offline_cpu(unsigned int cpu)
*/
int resctrl_init(void)
{
+ struct dentry *debuginfodir;
+ struct rdt_resource *r;
int ret = 0;
seq_buf_init(&last_cmd_status, last_cmd_status_buf,
@@ -4320,6 +4322,12 @@ int resctrl_init(void)
*/
debugfs_resctrl = debugfs_create_dir("resctrl", NULL);
+ /* Create debug info directories for each resource */
+ debuginfodir = debugfs_create_dir("info", debugfs_resctrl);
+
+ for_each_rdt_resource(r)
+ r->arch_debug_info = debugfs_create_dir(r->name, debuginfodir);
+
return 0;
cleanup_mountpoint:
^ permalink raw reply related [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-06 17:30 ` Luck, Tony
@ 2025-06-06 21:14 ` Reinette Chatre
2025-06-09 18:49 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-06 21:14 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Hi Tony,
On 6/6/25 10:30 AM, Luck, Tony wrote:
> On Fri, Jun 06, 2025 at 09:26:06AM -0700, Reinette Chatre wrote:
>> With /sys/kernel/debug/resctrl potentially mirroring /sys/fs/resctrl to
>> support various debugging scenarios there may later be resource level
>> debugging for which a "/sys/kernel/debug/resctrl/info/<resource>/<debugfile>" can
>> be used. Considering this it looks to me as though one possible boundary could
>> be to isolate arch specific debug to, for example, a new directory named
>> "/sys/kernel/debug/resctrl/info/arch_debug_name_tbd/". By placing the
>> arch debug in a sub-directory named "info" it avoids collision with resource
>> group names with naming that also avoids collision with resource names since
>> all these names are controlled by resctrl fs.
>
>
> That seems like a good path. PoC patch below. Note that I put the dentry
> for the debug info directory into struct rdt_resource. So no call from
> architecture to file system code needed to access.
ok, reading between the lines there is now a switch to per-resource
requirement, which fits with the use.
>
> Directory layout looks like this:
>
> # tree /sys/kernel/debug/resctrl/
> /sys/kernel/debug/resctrl/
> └── info
> ├── L2
> ├── L3
> ├── MB
> └── SMBA
>
This looks like something that needs to be owned and managed by
resctrl fs (more below).
> 6 directories, 0 files
>
> -Tony
>
> ---
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 5e28e81b35f6..78dd0f8f7ad8 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -281,6 +281,7 @@ enum resctrl_schema_fmt {
> * @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth
> * monitoring events can be configured.
> * @cdp_capable: Is the CDP feature available on this resource
> + * @arch_debug_info: Debugfs info directory for architecture use
> */
> struct rdt_resource {
> int rid;
> @@ -297,6 +298,7 @@ struct rdt_resource {
> enum resctrl_schema_fmt schema_fmt;
> unsigned int mbm_cfg_mask;
> bool cdp_capable;
> + struct dentry *arch_debug_info;
> };
ok ... but maybe not quite exactly (more below)
>
> /*
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index ed4fc45da346..48c587201fb6 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -4274,6 +4274,8 @@ void resctrl_offline_cpu(unsigned int cpu)
> */
> int resctrl_init(void)
> {
> + struct dentry *debuginfodir;
> + struct rdt_resource *r;
> int ret = 0;
>
> seq_buf_init(&last_cmd_status, last_cmd_status_buf,
> @@ -4320,6 +4322,12 @@ int resctrl_init(void)
> */
> debugfs_resctrl = debugfs_create_dir("resctrl", NULL);
>
> + /* Create debug info directories for each resource */
> + debuginfodir = debugfs_create_dir("info", debugfs_resctrl);
> +
> + for_each_rdt_resource(r)
> + r->arch_debug_info = debugfs_create_dir(r->name, debuginfodir);
This ignores (*) several of the boundaries my response aimed to establish.
Here are some red flags:
- This creates the resource named directory and hands off that pointer to the
arch. As I mentioned the arch should not have control over resctrl's debugfs.
I believe this is the type of information that should be in control of resctrl fs
since, as I mentioned, resctrl fs may need to add debugging that mirrors /sys/fs/resctrl.
- Blindly creating these directories (a) without the resource even existing on the
system, and (b) without being used/requested by the architecture does not create a good
interface in my opinion. User space will see a bunch of empty directories
associated with resources that are not present on the system.
- The directories created do not even match /sys/fs/resctrl/info when it comes
to the resources. Note that the directories within /sys/fs/resctrl/info are created
from the schema for control resources and appends _MON to monitor resources. Like
I mentioned in my earlier response there should ideally be space for a future
resctrl fs extension to mirror layout of /sys/fs/resctrl for resctrl fs debug
in debugfs. This solution ignores all of that.
I still think that the architecture should request the debugfs directory from resctrl fs.
This avoids resctrl fs needing to create directories/files that are never used and
does not present user space with an empty tree. Considering that the new PERF_PKG
resource may not come online until resctrl mount this should be something that can be
called at any time.
One possibility, that supports intended use while keeping the door open to support
future resctrl fs use of the debugfs, could be a new resctrl fs function,
for example resctrl_create_mon_resource_debugfs(struct rdt_resource *r), that will initialize
rdt_resource::arch_debug_info(*) to point to the dentry of newly created
/sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON/arch_debug_name_TBD *if*
the associated resource is capable of monitoring ... or do you think an architecture
may want to add debugging information before a resource is discovered/enabled?
If doing this then rdt_resource::arch_debug_info is no longer appropriate since it needs
to be specific to the monitoring resource. Perhaps then rdt_resource::arch_mon_debugfs
that would eventually live in [1]?
This is feeling rushed and I am sharing some top of mind ideas. I will give this
more thought.
Reinette
[1] https://lore.kernel.org/lkml/cb8425c73f57280b0b4f22e089b2912eede42f7a.1747349530.git.babu.moger@amd.com/
(*) I have now asked several times to stop ignoring feedback. This should not even
be necessary in the first place. I do not require you to agree with me and I do not claim
to always be right, please just stop ignoring feedback. The way forward I plan to ignore
messages that ignores feedback.
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option
2025-05-21 22:50 ` [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
2025-06-04 4:10 ` Reinette Chatre
@ 2025-06-06 23:55 ` Fenghua Yu
2025-06-08 21:52 ` Luck, Tony
1 sibling, 1 reply; 90+ messages in thread
From: Fenghua Yu @ 2025-06-06 23:55 UTC (permalink / raw)
To: Tony Luck, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi, Tony,
On 5/21/25 15:50, Tony Luck wrote:
> Users may want to force either of the telemetry features on
> (in the case where they are disabled due to erratum) or off
> (in the case that a limited number of RMIDs for a telemetry
> feature reduces the number of monitor groups that can be
> created.)
>
> Unlike other options that are tied to X86_FEATURE_* flags,
> these must be queried by name. Add a function to do that.
>
> Add checks for users who forced either feature off.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> .../admin-guide/kernel-parameters.txt | 2 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 4 +++
> arch/x86/kernel/cpu/resctrl/core.c | 28 +++++++++++++++++++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 6 ++++
> 4 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index d9fd26b95b34..4811bc812f0f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5988,7 +5988,7 @@
> rdt= [HW,X86,RDT]
> Turn on/off individual RDT features. List is:
> cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
> - mba, smba, bmec.
> + mba, smba, bmec, energy, perf.
> E.g. to turn on cmt and turn off mba use:
> rdt=cmt,!mba
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 42da0a222c7c..524f3c183900 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -167,6 +167,10 @@ void __init intel_rdt_mbm_apply_quirk(void);
>
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
>
> +bool rdt_is_option_force_enabled(char *option);
> +
> +bool rdt_is_option_force_disabled(char *option);
> +
> bool intel_aet_get_events(void);
> void __exit intel_aet_exit(void);
> int intel_aet_read_event(int domid, int rmid, enum resctrl_event_id evtid, u64 *val);
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index f07f5b58639a..b23309566500 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -797,6 +797,8 @@ enum {
> RDT_FLAG_MBA,
> RDT_FLAG_SMBA,
> RDT_FLAG_BMEC,
> + RDT_FLAG_ENERGY,
> + RDT_FLAG_PERF,
> };
>
> #define RDT_OPT(idx, n, f) \
> @@ -822,6 +824,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
> RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
> RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
> RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
> + RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
> + RDT_OPT(RDT_FLAG_PERF, "perf", 0),
Boot options "energy" and "perf" are PMT event groups level. Other boot
options are individual event level.
E.g. "!perf" forces off all 7 PMT PERF events.
e.g. "uops retired" event has an erratum but all other PERF events work
fine. Disabling "perf" group disables all PERF events. Is "!perf" a
useful boot option?
Is there any consideration to have boot options at PMT event group level
instead of individual PMT event level like legacy events?
[SNIP]
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events
2025-05-21 22:50 ` [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events Tony Luck
2025-05-23 9:00 ` Peter Newman
2025-06-04 3:29 ` Reinette Chatre
@ 2025-06-07 0:45 ` Fenghua Yu
2025-06-08 21:59 ` Luck, Tony
2 siblings, 1 reply; 90+ messages in thread
From: Fenghua Yu @ 2025-06-07 0:45 UTC (permalink / raw)
To: Tony Luck, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi, Tony,
On 5/21/25 15:50, Tony Luck wrote:
> There's a rule in computer programming that objects appear zero,
> once, or many times. So code accordingly.
>
> There are two MBM events and resctrl is coded with a lot of
>
> if (local)
> do one thing
> if (total)
> do a different thing
>
> Change the rdt_mon_domain and rdt_hw_mon_domain structures to hold arrays
> of pointers to per event data instead of explicit fields for total and
> local bandwidth.
>
> Simplify by coding for many events using loops on which are enabled.
>
> Move resctrl_is_mbm_event() to <linux/resctrl.h> so it can be used more
> widely. Also provide a for_each_mbm_event() helper macro.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 15 +++++---
> include/linux/resctrl_types.h | 3 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 6 ++--
> arch/x86/kernel/cpu/resctrl/core.c | 38 ++++++++++----------
> arch/x86/kernel/cpu/resctrl/monitor.c | 36 +++++++++----------
> fs/resctrl/monitor.c | 13 ++++---
> fs/resctrl/rdtgroup.c | 48 ++++++++++++--------------
> 7 files changed, 82 insertions(+), 77 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 843ad7c8e247..40f2d0d48d02 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -161,8 +161,7 @@ struct rdt_ctrl_domain {
> * @hdr: common header for different domain types
> * @ci: cache info for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> - * @mbm_total: saved state for MBM total bandwidth
> - * @mbm_local: saved state for MBM local bandwidth
> + * @mbm_states: saved state for each QOS MBM event
> * @mbm_over: worker to periodically read MBM h/w counters
> * @cqm_limbo: worker to periodically read CQM h/w counters
> * @mbm_work_cpu: worker CPU for MBM h/w counters
> @@ -172,8 +171,7 @@ struct rdt_mon_domain {
> struct rdt_domain_hdr hdr;
> struct cacheinfo *ci;
> unsigned long *rmid_busy_llc;
> - struct mbm_state *mbm_total;
> - struct mbm_state *mbm_local;
> + struct mbm_state *mbm_states[QOS_NUM_L3_MBM_EVENTS];
> struct delayed_work mbm_over;
> struct delayed_work cqm_limbo;
> int mbm_work_cpu;
> @@ -376,6 +374,15 @@ bool resctrl_is_mon_event_enabled(enum resctrl_event_id evt);
>
> bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt);
>
> +static inline bool resctrl_is_mbm_event(enum resctrl_event_id e)
> +{
> + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> + e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> +}
> +
> +#define for_each_mbm_event(evt) \
> + for (evt = QOS_L3_MBM_TOTAL_EVENT_ID; evt <= QOS_L3_MBM_LOCAL_EVENT_ID; evt++)
> +
> /**
> * resctrl_arch_mon_event_config_write() - Write the config for an event.
> * @config_info: struct resctrl_mon_config_info describing the resource, domain
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index a25fb9c4070d..b468bfbab9ea 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -47,4 +47,7 @@ enum resctrl_event_id {
> QOS_NUM_EVENTS,
> };
>
> +#define QOS_NUM_L3_MBM_EVENTS (QOS_L3_MBM_LOCAL_EVENT_ID - QOS_L3_MBM_TOTAL_EVENT_ID + 1)
> +#define MBM_STATE_IDX(evt) ((evt) - QOS_L3_MBM_TOTAL_EVENT_ID)
> +
> #endif /* __LINUX_RESCTRL_TYPES_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 5e3c41b36437..ea185b4d0d59 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -54,15 +54,13 @@ struct rdt_hw_ctrl_domain {
> * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> * a resource for a monitor function
> * @d_resctrl: Properties exposed to the resctrl file system
> - * @arch_mbm_total: arch private state for MBM total bandwidth
> - * @arch_mbm_local: arch private state for MBM local bandwidth
> + * @arch_mbm_states: arch private state for each MBM event
> *
> * Members of this structure are accessed via helpers that provide abstraction.
> */
> struct rdt_hw_mon_domain {
> struct rdt_mon_domain d_resctrl;
> - struct arch_mbm_state *arch_mbm_total;
> - struct arch_mbm_state *arch_mbm_local;
> + struct arch_mbm_state *arch_mbm_states[QOS_NUM_L3_MBM_EVENTS];
> };
>
> static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 819bc7a09327..4403a820db12 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -364,8 +364,8 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
>
> static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
> {
> - kfree(hw_dom->arch_mbm_total);
> - kfree(hw_dom->arch_mbm_local);
> + for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
> + kfree(hw_dom->arch_mbm_states[i]);
Is it better to define a helper for_each_mon_event_idx(i)?
#define for_each_mbm_event_idx(i) \
for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
Then the above for loop can be simplified to:
for_each_mbm_event_idxd(i)
kfree(hw_dom->arch_mbm_states[i]);
The helper can be used in other places as well (see below).
> kfree(hw_dom);
> }
>
> @@ -399,25 +399,27 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
> */
> static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
> {
> - size_t tsize;
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> - tsize = sizeof(*hw_dom->arch_mbm_total);
> - hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL);
> - if (!hw_dom->arch_mbm_total)
> - return -ENOMEM;
> - }
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
> - tsize = sizeof(*hw_dom->arch_mbm_local);
> - hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL);
> - if (!hw_dom->arch_mbm_local) {
> - kfree(hw_dom->arch_mbm_total);
> - hw_dom->arch_mbm_total = NULL;
> - return -ENOMEM;
> - }
> + size_t tsize = sizeof(struct arch_mbm_state);
> + enum resctrl_event_id evt;
> + int idx;
> +
> + for_each_mbm_event(evt) {
> + if (!resctrl_is_mon_event_enabled(evt))
> + continue;
> + idx = MBM_STATE_IDX(evt);
> + hw_dom->arch_mbm_states[idx] = kcalloc(num_rmid, tsize, GFP_KERNEL);
> + if (!hw_dom->arch_mbm_states[idx])
> + goto cleanup;
> }
>
> return 0;
> +cleanup:
> + while (--idx >= 0) {
> + kfree(hw_dom->arch_mbm_states[idx]);
> + hw_dom->arch_mbm_states[idx] = NULL;
> + }
> +
> + return -ENOMEM;
> }
>
> static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index fda579251dba..85526e5540f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -160,18 +160,14 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
> u32 rmid,
> enum resctrl_event_id eventid)
> {
> - switch (eventid) {
> - case QOS_L3_OCCUP_EVENT_ID:
> - return NULL;
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &hw_dom->arch_mbm_total[rmid];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &hw_dom->arch_mbm_local[rmid];
> - default:
> - /* Never expect to get here */
> - WARN_ON_ONCE(1);
> + struct arch_mbm_state *state;
> +
> + if (!resctrl_is_mbm_event(eventid))
> return NULL;
> - }
> +
> + state = hw_dom->arch_mbm_states[MBM_STATE_IDX(eventid)];
> +
> + return state ? &state[rmid] : NULL;
> }
>
> void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> @@ -200,14 +196,16 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
> void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_total, 0,
> - sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
> -
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
> - memset(hw_dom->arch_mbm_local, 0,
> - sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
> + enum resctrl_event_id evt;
> + int idx;
> +
> + for_each_mbm_event(evt) {
> + idx = MBM_STATE_IDX(evt);
> + if (!hw_dom->arch_mbm_states[idx])
> + continue;
> + memset(hw_dom->arch_mbm_states[idx], 0,
> + sizeof(struct arch_mbm_state) * r->num_rmid);
> + }
> }
>
> static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 325e23c1a403..4cd0789998bf 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -346,15 +346,14 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
> u32 rmid, enum resctrl_event_id evtid)
> {
> u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
> + struct mbm_state *states;
>
> - switch (evtid) {
> - case QOS_L3_MBM_TOTAL_EVENT_ID:
> - return &d->mbm_total[idx];
> - case QOS_L3_MBM_LOCAL_EVENT_ID:
> - return &d->mbm_local[idx];
> - default:
> + if (!resctrl_is_mbm_event(evtid))
> return NULL;
> - }
> +
> + states = d->mbm_states[MBM_STATE_IDX(evtid)];
> +
> + return states ? &states[idx] : NULL;
> }
>
> static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 80e74940281a..8649b89d7bfd 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -127,12 +127,6 @@ static bool resctrl_is_mbm_enabled(void)
> resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID));
> }
>
> -static bool resctrl_is_mbm_event(int e)
> -{
> - return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
> - e <= QOS_L3_MBM_LOCAL_EVENT_ID);
> -}
> -
> /*
> * Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap
> * of free CLOSIDs.
> @@ -4020,8 +4014,10 @@ static void rdtgroup_setup_default(void)
> static void domain_destroy_mon_state(struct rdt_mon_domain *d)
> {
> bitmap_free(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - kfree(d->mbm_local);
> + for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++) {
The for loop can be simplified by the new helper:
for_each_mbm_event_idx(i) {
> + kfree(d->mbm_states[i]);
> + d->mbm_states[i] = NULL;
> + }
> }
>
> void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
> @@ -4081,32 +4077,34 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
> {
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> - size_t tsize;
> + size_t tsize = sizeof(struct mbm_state);
> + enum resctrl_event_id evt;
> + int idx;
>
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) {
> d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL);
> if (!d->rmid_busy_llc)
> return -ENOMEM;
> }
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID)) {
> - tsize = sizeof(*d->mbm_total);
> - d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL);
> - if (!d->mbm_total) {
> - bitmap_free(d->rmid_busy_llc);
> - return -ENOMEM;
> - }
> - }
> - if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID)) {
> - tsize = sizeof(*d->mbm_local);
> - d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL);
> - if (!d->mbm_local) {
> - bitmap_free(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - return -ENOMEM;
> - }
> +
> + for_each_mbm_event(evt) {
> + if (!resctrl_is_mon_event_enabled(evt))
> + continue;
> + idx = MBM_STATE_IDX(evt);
> + d->mbm_states[idx] = kcalloc(idx_limit, tsize, GFP_KERNEL);
> + if (!d->mbm_states[idx])
> + goto cleanup;
> }
>
> return 0;
> +cleanup:
> + bitmap_free(d->rmid_busy_llc);
> + while (--idx >= 0) {
> + kfree(d->mbm_states[idx]);
> + d->mbm_states[idx] = NULL;
> + }
> +
> + return -ENOMEM;
> }
>
> int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
for_each_mbm_event() can simplified mbm_update() as well?
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- mbm_update_one_event(r, d, closid, rmid,
QOS_L3_MBM_TOTAL_EVENT_ID);
-
- if (resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- mbm_update_one_event(r, d, closid, rmid,
QOS_L3_MBM_LOCAL_EVENT_ID);
+ for_each_mbm_event_idx(evt) {
+ if (resctrl_is_mon_event_enabled(evt))
+ mbm_update_one_event(r, d, closid, rmid, evt);
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr
2025-05-21 22:50 ` [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr Tony Luck
2025-05-22 0:01 ` Keshavamurthy, Anil S
2025-06-04 3:37 ` Reinette Chatre
@ 2025-06-07 0:52 ` Fenghua Yu
2025-06-08 22:02 ` Luck, Tony
2 siblings, 1 reply; 90+ messages in thread
From: Fenghua Yu @ 2025-06-07 0:52 UTC (permalink / raw)
To: Tony Luck, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi, Tony,
On 5/21/25 15:50, Tony Luck wrote:
> Historically all monitoring events have been associated with the L3
> resource and it made sense to use "struct rdt_mon_domain *" arguments
> to functions manipulating domains. But the addition of monitor events
> tied to other resources changes this assumption.
>
> Some functionality like:
> *) adding a CPU to an existing domain
> *) removing a CPU that is not the last one from a domain
> can be achieved with just access to the rdt_domain_hdr structure.
>
> Change arguments from "rdt_*_domain" to rdt_domain_hdr so functions
> can be used on domains from any resource.
>
> Add sanity checks where container_of() is used to find the surrounding
> domain structure that hdr has the expected type.
>
> Simplify code that uses "d->hdr." to "hdr->" where possible.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 4 +-
> arch/x86/kernel/cpu/resctrl/core.c | 39 +++++++-------
> fs/resctrl/rdtgroup.c | 83 +++++++++++++++++++++---------
> 3 files changed, 79 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index d6b09952ef92..c02a4d59f3eb 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -444,9 +444,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type type);
> int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
> void resctrl_online_cpu(unsigned int cpu);
> void resctrl_offline_cpu(unsigned int cpu);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index e4125161ffbd..71b884f25475 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -458,9 +458,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (hdr) {
> if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> return;
> - d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> -
> - cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + cpumask_set_cpu(cpu, &hdr->cpu_mask);
> if (r->cache.arch_has_per_cpu_cfg)
> rdt_domain_reconfigure_cdp(r);
> return;
> @@ -524,7 +522,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>
> list_add_tail_rcu(&d->hdr.list, add_pos);
>
> - err = resctrl_online_mon_domain(r, d);
> + err = resctrl_online_mon_domain(r, &d->hdr);
> if (err) {
> list_del_rcu(&d->hdr.list);
> synchronize_rcu();
> @@ -597,25 +595,24 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
> return;
>
> + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> + if (!cpumask_empty(&hdr->cpu_mask))
> + return;
> +
> d = container_of(hdr, struct rdt_ctrl_domain, hdr);
> hw_dom = resctrl_to_arch_ctrl_dom(d);
>
> - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> - if (cpumask_empty(&d->hdr.cpu_mask)) {
> - resctrl_offline_ctrl_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> - synchronize_rcu();
> -
> - /*
> - * rdt_ctrl_domain "d" is going to be freed below, so clear
> - * its pointer from pseudo_lock_region struct.
> - */
> - if (d->plr)
> - d->plr->d = NULL;
> - ctrl_domain_free(hw_dom);
> + resctrl_offline_ctrl_domain(r, d);
> + list_del_rcu(&hdr->list);
> + synchronize_rcu();
>
> - return;
> - }
> + /*
> + * rdt_ctrl_domain "d" is going to be freed below, so clear
> + * its pointer from pseudo_lock_region struct.
> + */
> + if (d->plr)
> + d->plr->d = NULL;
> + ctrl_domain_free(hw_dom);
> }
>
> static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> @@ -651,8 +648,8 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> case RDT_RESOURCE_L3:
> d = container_of(hdr, struct rdt_mon_domain, hdr);
> hw_dom = resctrl_to_arch_mon_dom(d);
> - resctrl_offline_mon_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> + resctrl_offline_mon_domain(r, hdr);
> + list_del_rcu(&hdr->list);
> synchronize_rcu();
> l3_mon_domain_free(hw_dom);
> break;
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 828c743ec470..0213fb3a1113 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3022,7 +3022,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> * when last domain being summed is removed.
> */
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct rdtgroup *prgrp, *crgrp;
> char subname[32];
> @@ -3030,9 +3030,17 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> char name[32];
>
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> - if (snc_mode)
> - sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + if (snc_mode) {
> + struct rdt_mon_domain *d;
> +
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> + sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
> + } else {
> + sprintf(name, "mon_%s_%02d", r->name, hdr->id);
> + }
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> @@ -3042,11 +3050,12 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> }
> }
>
> -static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp,
> bool do_sum)
> {
> struct rmid_read rr = {0};
> + struct rdt_mon_domain *d;
> struct mon_data *priv;
> struct mon_evt *mevt;
> int ret, domid;
> @@ -3054,7 +3063,14 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> for (mevt = &mon_event_all[0]; mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++) {
> if (mevt->rid != r->rid || !mevt->enabled)
> continue;
> - domid = do_sum ? d->ci->id : d->hdr.id;
> + if (r->rid == RDT_RESOURCE_L3) {
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return -EINVAL;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + domid = do_sum ? d->ci->id : d->hdr.id;
> + } else {
> + domid = hdr->id;
> + }
> priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
> if (WARN_ON_ONCE(!priv))
> return -EINVAL;
> @@ -3063,18 +3079,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> if (ret)
> return ret;
>
> - if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
> - mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
> + if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
> + mon_event_read(&rr, r, d, prgrp, &hdr->cpu_mask, mevt->evtid, true);
> }
>
> return 0;
> }
>
> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> - struct rdt_mon_domain *d,
> + struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> struct kernfs_node *kn, *ckn;
> + struct rdt_mon_domain *d;
> char name[32];
> bool snc_mode;
> int ret = 0;
> @@ -3082,7 +3099,14 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> lockdep_assert_held(&rdtgroup_mutex);
>
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);
> + if (snc_mode) {
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return -EINVAL;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> + } else {
> + sprintf(name, "mon_%s_%02d", r->name, hdr->id);
> + }
> kn = kernfs_find_and_get(parent_kn, name);
> if (kn) {
> /*
> @@ -3098,13 +3122,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> ret = rdtgroup_kn_set_ugid(kn);
> if (ret)
> goto out_destroy;
> - ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
> + ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
> if (ret)
> goto out_destroy;
> }
>
> if (snc_mode) {
> - sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
> ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> if (IS_ERR(ckn)) {
> ret = -EINVAL;
> @@ -3115,7 +3139,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> if (ret)
> goto out_destroy;
>
> - ret = mon_add_all_files(ckn, d, r, prgrp, false);
> + ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
> if (ret)
> goto out_destroy;
> }
> @@ -3133,7 +3157,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> * and "monitor" groups with given domain id.
> */
> static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct kernfs_node *parent_kn;
> struct rdtgroup *prgrp, *crgrp;
> @@ -3141,12 +3165,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> parent_kn = prgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
>
> head = &prgrp->mon.crdtgrp_list;
> list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
> parent_kn = crgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, crgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
> }
> }
> }
> @@ -3155,14 +3179,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_resource *r,
> struct rdtgroup *prgrp)
> {
> - struct rdt_mon_domain *dom;
> + struct rdt_domain_hdr *hdr;
> int ret;
>
> /* Walking r->domains, ensure it can't race with cpuhp */
> lockdep_assert_cpus_held();
>
> - list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> - ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> + list_for_each_entry(hdr, &r->mon_domains, list) {
> + ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
> if (ret)
> return ret;
> }
> @@ -4030,8 +4054,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> + struct rdt_mon_domain *d;
> +
> mutex_lock(&rdtgroup_mutex);
>
> /*
> @@ -4039,11 +4065,15 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> * per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - rmdir_mondata_subdir_allrdtgrp(r, d);
> + rmdir_mondata_subdir_allrdtgrp(r, hdr);
>
> if (r->rid != RDT_RESOURCE_L3)
> goto done;
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + return;
rdtgroup_mutex is being locked right now. Cannot return without
unlocking it.
s/return;/goto done;/
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> if (resctrl_is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
> @@ -4126,12 +4156,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
> return err;
> }
>
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> - int err;
> + struct rdt_mon_domain *d;
> + int err = -EINVAL;
>
> mutex_lock(&rdtgroup_mutex);
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + goto out_unlock;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> err = domain_setup_l3_mon_state(r, d);
> if (err)
> goto out_unlock;
> @@ -4152,7 +4187,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> * If resctrl is mounted, add per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - mkdir_mondata_subdir_allrdtgrp(r, d);
> + mkdir_mondata_subdir_allrdtgrp(r, hdr);
>
> out_unlock:
> mutex_unlock(&rdtgroup_mutex);
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-05-21 22:50 ` [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
2025-06-04 4:06 ` Reinette Chatre
@ 2025-06-07 0:54 ` Fenghua Yu
2025-06-08 22:03 ` Luck, Tony
1 sibling, 1 reply; 90+ messages in thread
From: Fenghua Yu @ 2025-06-07 0:54 UTC (permalink / raw)
To: Tony Luck, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi, Tony,
On 5/21/25 15:50, Tony Luck wrote:
> The L3 resource has several requirements for domains. There are structures
> that hold the 64-bit values of counters, and elements to keep track of
> the overflow and limbo threads.
>
> None of these are needed for the PERF_PKG resource. The hardware counters
> are wide enough that they do not wrap around for decades.
>
> Define a new rdt_perf_pkg_mon_domain structure which just consists of
> the standard rdt_domain_hdr to keep track of domain id and CPU mask.
>
> Change domain_add_cpu_mon(), domain_remove_cpu_mon(),
> resctrl_offline_mon_domain(), and resctrl_online_mon_domain() to check
> resource type and perform only the operations needed for domains in the
> PERF_PKG resource.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 41 ++++++++++++++++++++++++++++++
> fs/resctrl/rdtgroup.c | 4 +++
> 2 files changed, 45 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 64ce561e77a0..18d84c497ee4 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -540,6 +540,38 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
> }
> }
>
> +/**
> + * struct rdt_perf_pkg_mon_domain - CPUs sharing an Intel-PMT-scoped resctrl monitor resource
> + * @hdr: common header for different domain types
> + */
> +struct rdt_perf_pkg_mon_domain {
> + struct rdt_domain_hdr hdr;
> +};
> +
> +static void setup_intel_aet_mon_domain(int cpu, int id, struct rdt_resource *r,
> + struct list_head *add_pos)
> +{
> + struct rdt_perf_pkg_mon_domain *d;
> + int err;
> +
> + d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
> + if (!d)
> + return;
> +
> + d->hdr.id = id;
> + d->hdr.type = RESCTRL_MON_DOMAIN;
> + d->hdr.rid = r->rid;
> + cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
> + list_add_tail_rcu(&d->hdr.list, add_pos);
> +
> + err = resctrl_online_mon_domain(r, &d->hdr);
> + if (err) {
> + list_del_rcu(&d->hdr.list);
> + synchronize_rcu();
> + kfree(d);
> + }
> +}
> +
> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> @@ -567,6 +599,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> case RDT_RESOURCE_L3:
> l3_mon_domain_setup(cpu, id, r, add_pos);
> break;
> + case RDT_RESOURCE_PERF_PKG:
> + setup_intel_aet_mon_domain(cpu, id, r, add_pos);
> + break;
> default:
> WARN_ON_ONCE(1);
> }
> @@ -666,6 +701,12 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> default:
> pr_warn_once("Unknown resource rid=%d\n", r->rid);
> break;
> + case RDT_RESOURCE_PERF_PKG:
> + resctrl_offline_mon_domain(r, hdr);
> + list_del_rcu(&hdr->list);
> + synchronize_rcu();
> + kfree(container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr));
> + break;
> }
Why default is not the last one?
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events
2025-05-21 22:50 ` [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
2025-06-04 3:57 ` Reinette Chatre
@ 2025-06-07 0:57 ` Fenghua Yu
2025-06-08 22:05 ` Luck, Tony
1 sibling, 1 reply; 90+ messages in thread
From: Fenghua Yu @ 2025-06-07 0:57 UTC (permalink / raw)
To: Tony Luck, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi, Tony,
On 5/21/25 15:50, Tony Luck wrote:
> Clearwater Forest supports two energy related telemetry events
> and seven perf style events. The counters are arranged in per-RMID
> blocks like this:
>
> MMIO offset:0x00 Counter for RMID 0 Event 0
> MMIO offset:0x08 Counter for RMID 0 Event 1
> MMIO offset:0x10 Counter for RMID 0 Event 2
> MMIO offset:0x18 Counter for RMID 1 Event 0
> MMIO offset:0x20 Counter for RMID 1 Event 1
> MMIO offset:0x28 Counter for RMID 1 Event 2
> ...
>
> Define these events in the file system code and add the events
> to the event_group structures.
>
> PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point
> format. File system code must output as floating point values.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl_types.h | 11 ++++++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 33 ++++++++++++++++++
> fs/resctrl/monitor.c | 45 +++++++++++++++++++++++++
> 3 files changed, 89 insertions(+)
>
> diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
> index b468bfbab9ea..455b29a0a9b9 100644
> --- a/include/linux/resctrl_types.h
> +++ b/include/linux/resctrl_types.h
> @@ -43,6 +43,17 @@ enum resctrl_event_id {
> QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
> QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
>
> + /* Intel Telemetry Events */
> + PMT_EVENT_ENERGY,
> + PMT_EVENT_ACTIVITY,
> + PMT_EVENT_STALLS_LLC_HIT,
> + PMT_EVENT_C1_RES,
> + PMT_EVENT_UNHALTED_CORE_CYCLES,
> + PMT_EVENT_STALLS_LLC_MISS,
> + PMT_EVENT_AUTO_C6_RES,
> + PMT_EVENT_UNHALTED_REF_CYCLES,
> + PMT_EVENT_UOPS_RETIRED,
> +
> /* Must be the last */
> QOS_NUM_EVENTS,
> };
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 2316198eb69e..bf8e2a6126d2 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -34,6 +34,20 @@ struct mmio_info {
> void __iomem *addrs[] __counted_by(count);
> };
>
> +/**
> + * struct pmt_event - Telemetry event.
> + * @evtid: Resctrl event id
> + * @evt_idx: Counter index within each per-RMID block of counters
> + * @bin_bits: Zero for integer valued events, else number bits in fixed-point
> + */
> +struct pmt_event {
> + enum resctrl_event_id evtid;
> + int evt_idx;
> + int bin_bits;
> +};
> +
> +#define EVT(id, idx, bits) { .evtid = id, .evt_idx = idx, .bin_bits = bits }
> +
> /**
> * struct event_group - All information about a group of telemetry events.
> * @pfg: Points to the aggregated telemetry space information
> @@ -42,6 +56,8 @@ struct mmio_info {
> * @pkginfo: Per-package MMIO addresses of telemetry regions belonging to this group
> * @guid: Unique number per XML description file.
> * @mmio_size: Number of bytes of MMIO registers for this group.
> + * @num_events: Number of events in this group.
> + * @evts: Array of event descriptors.
> */
> struct event_group {
> /* Data fields used by this code. */
> @@ -51,6 +67,8 @@ struct event_group {
> /* Remaining fields initialized from XML file. */
> u32 guid;
> size_t mmio_size;
> + int num_events;
> + struct pmt_event evts[] __counted_by(num_events);
> };
>
> /*
> @@ -60,6 +78,11 @@ struct event_group {
> static struct event_group energy_0x26696143 = {
> .guid = 0x26696143,
> .mmio_size = (576 * 2 + 3) * 8,
> + .num_events = 2,
> + .evts = {
Please align the "=" with the above "=".
> + EVT(PMT_EVENT_ENERGY, 0, 18),
> + EVT(PMT_EVENT_ACTIVITY, 1, 18),
> + }
> };
>
> /*
> @@ -69,6 +92,16 @@ static struct event_group energy_0x26696143 = {
> static struct event_group perf_0x26557651 = {
> .guid = 0x26557651,
> .mmio_size = (576 * 7 + 3) * 8,
> + .num_events = 7,
> + .evts = {
Ditto.
[SNIP]
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option
2025-06-06 23:55 ` Fenghua Yu
@ 2025-06-08 21:52 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-06-08 21:52 UTC (permalink / raw)
To: Fenghua Yu
Cc: Reinette Chatre, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 04:55:37PM -0700, Fenghua Yu wrote:
> Hi, Tony,
>
> On 5/21/25 15:50, Tony Luck wrote:
> > Users may want to force either of the telemetry features on
> > (in the case where they are disabled due to erratum) or off
> > (in the case that a limited number of RMIDs for a telemetry
> > feature reduces the number of monitor groups that can be
> > created.)
[SNIP]
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index f07f5b58639a..b23309566500 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -797,6 +797,8 @@ enum {
> > RDT_FLAG_MBA,
> > RDT_FLAG_SMBA,
> > RDT_FLAG_BMEC,
> > + RDT_FLAG_ENERGY,
> > + RDT_FLAG_PERF,
> > };
> > #define RDT_OPT(idx, n, f) \
> > @@ -822,6 +824,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
> > RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
> > RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
> > RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
> > + RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
> > + RDT_OPT(RDT_FLAG_PERF, "perf", 0),
>
> Boot options "energy" and "perf" are PMT event groups level. Other boot
> options are individual event level.
>
> E.g. "!perf" forces off all 7 PMT PERF events.
>
> e.g. "uops retired" event has an erratum but all other PERF events work
> fine. Disabling "perf" group disables all PERF events. Is "!perf" a useful
> boot option?
>
> Is there any consideration to have boot options at PMT event group level
> instead of individual PMT event level like legacy events?
This could be done. But it would add some complexity that may never
be needed. I'm optimisitic that all events in a group will work.
>
> [SNIP]
>
> Thanks.
>
> -Fenghua
>
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events
2025-06-07 0:45 ` Fenghua Yu
@ 2025-06-08 21:59 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-06-08 21:59 UTC (permalink / raw)
To: Fenghua Yu
Cc: Reinette Chatre, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 05:45:58PM -0700, Fenghua Yu wrote:
[SNIP]
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 819bc7a09327..4403a820db12 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -364,8 +364,8 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
> > static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
> > {
> > - kfree(hw_dom->arch_mbm_total);
> > - kfree(hw_dom->arch_mbm_local);
> > + for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
> > + kfree(hw_dom->arch_mbm_states[i]);
>
> Is it better to define a helper for_each_mon_event_idx(i)?
>
> #define for_each_mbm_event_idx(i) \
>
> for (int i = 0; i < QOS_NUM_L3_MBM_EVENTS; i++)
>
> Then the above for loop can be simplified to:
>
> for_each_mbm_event_idxd(i)
>
> kfree(hw_dom->arch_mbm_states[i]);
>
> The helper can be used in other places as well (see below).
I think there are only two places total. So maybe?
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr
2025-06-07 0:52 ` Fenghua Yu
@ 2025-06-08 22:02 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-06-08 22:02 UTC (permalink / raw)
To: Fenghua Yu
Cc: Reinette Chatre, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 05:52:16PM -0700, Fenghua Yu wrote:
> > -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> > +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> > {
> > + struct rdt_mon_domain *d;
> > +
> > mutex_lock(&rdtgroup_mutex);
> > /*
> > @@ -4039,11 +4065,15 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> > * per domain monitor data directories.
> > */
> > if (resctrl_mounted && resctrl_arch_mon_capable())
> > - rmdir_mondata_subdir_allrdtgrp(r, d);
> > + rmdir_mondata_subdir_allrdtgrp(r, hdr);
> > if (r->rid != RDT_RESOURCE_L3)
> > goto done;
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> > + return;
>
> rdtgroup_mutex is being locked right now. Cannot return without unlocking
> it.
>
> s/return;/goto done;/
Yup. Though "goto out_unlock" to meet resctrl style of more meaningful
goto label names.
Thanks
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-06-07 0:54 ` Fenghua Yu
@ 2025-06-08 22:03 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-06-08 22:03 UTC (permalink / raw)
To: Fenghua Yu
Cc: Reinette Chatre, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 05:54:29PM -0700, Fenghua Yu wrote:
> > @@ -666,6 +701,12 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> > default:
> > pr_warn_once("Unknown resource rid=%d\n", r->rid);
> > break;
> > + case RDT_RESOURCE_PERF_PKG:
> > + resctrl_offline_mon_domain(r, hdr);
> > + list_del_rcu(&hdr->list);
> > + synchronize_rcu();
> > + kfree(container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr));
> > + break;
> > }
>
> Why default is not the last one?
Fixed.
Thanks
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events
2025-06-07 0:57 ` Fenghua Yu
@ 2025-06-08 22:05 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-06-08 22:05 UTC (permalink / raw)
To: Fenghua Yu
Cc: Reinette Chatre, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 05:57:09PM -0700, Fenghua Yu wrote:
> > @@ -60,6 +78,11 @@ struct event_group {
> > static struct event_group energy_0x26696143 = {
> > .guid = 0x26696143,
> > .mmio_size = (576 * 2 + 3) * 8,
> > + .num_events = 2,
> > + .evts = {
> Please align the "=" with the above "=".
Fixed.
> > + EVT(PMT_EVENT_ENERGY, 0, 18),
> > + EVT(PMT_EVENT_ACTIVITY, 1, 18),
> > + }
> > };
> > /*
> > @@ -69,6 +92,16 @@ static struct event_group energy_0x26696143 = {
> > static struct event_group perf_0x26557651 = {
> > .guid = 0x26557651,
> > .mmio_size = (576 * 7 + 3) * 8,
> > + .num_events = 7,
> > + .evts = {
>
> Ditto.
Ditto fixed.
Thanks
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-06 21:14 ` Reinette Chatre
@ 2025-06-09 18:49 ` Luck, Tony
2025-06-09 22:39 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-09 18:49 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 02:14:56PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 6/6/25 10:30 AM, Luck, Tony wrote:
> > On Fri, Jun 06, 2025 at 09:26:06AM -0700, Reinette Chatre wrote:
> >> With /sys/kernel/debug/resctrl potentially mirroring /sys/fs/resctrl to
> >> support various debugging scenarios there may later be resource level
> >> debugging for which a "/sys/kernel/debug/resctrl/info/<resource>/<debugfile>" can
> >> be used. Considering this it looks to me as though one possible boundary could
> >> be to isolate arch specific debug to, for example, a new directory named
> >> "/sys/kernel/debug/resctrl/info/arch_debug_name_tbd/". By placing the
> >> arch debug in a sub-directory named "info" it avoids collision with resource
> >> group names with naming that also avoids collision with resource names since
> >> all these names are controlled by resctrl fs.
> >
> >
> > That seems like a good path. PoC patch below. Note that I put the dentry
> > for the debug info directory into struct rdt_resource. So no call from
> > architecture to file system code needed to access.
>
> ok, reading between the lines there is now a switch to per-resource
> requirement, which fits with the use.
>
> >
> > Directory layout looks like this:
> >
> > # tree /sys/kernel/debug/resctrl/
> > /sys/kernel/debug/resctrl/
> > └── info
> > ├── L2
> > ├── L3
> > ├── MB
> > └── SMBA
> >
>
> This looks like something that needs to be owned and managed by
> resctrl fs (more below).
>
> > 6 directories, 0 files
> >
> > -Tony
> >
> > ---
> >
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 5e28e81b35f6..78dd0f8f7ad8 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -281,6 +281,7 @@ enum resctrl_schema_fmt {
> > * @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth
> > * monitoring events can be configured.
> > * @cdp_capable: Is the CDP feature available on this resource
> > + * @arch_debug_info: Debugfs info directory for architecture use
> > */
> > struct rdt_resource {
> > int rid;
> > @@ -297,6 +298,7 @@ struct rdt_resource {
> > enum resctrl_schema_fmt schema_fmt;
> > unsigned int mbm_cfg_mask;
> > bool cdp_capable;
> > + struct dentry *arch_debug_info;
> > };
>
> ok ... but maybe not quite exactly (more below)
Would have been useful with the "always create directories" approach.
As you point out below the name is problematic. Would need separate
entries for control and monitor resources like RDT_RESOURCE_L3.
I don't think it is useful in the "only make directories when requested
by architecture" mode.
> >
> > /*
> > diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> > index ed4fc45da346..48c587201fb6 100644
> > --- a/fs/resctrl/rdtgroup.c
> > +++ b/fs/resctrl/rdtgroup.c
> > @@ -4274,6 +4274,8 @@ void resctrl_offline_cpu(unsigned int cpu)
> > */
> > int resctrl_init(void)
> > {
> > + struct dentry *debuginfodir;
> > + struct rdt_resource *r;
> > int ret = 0;
> >
> > seq_buf_init(&last_cmd_status, last_cmd_status_buf,
> > @@ -4320,6 +4322,12 @@ int resctrl_init(void)
> > */
> > debugfs_resctrl = debugfs_create_dir("resctrl", NULL);
> >
> > + /* Create debug info directories for each resource */
> > + debuginfodir = debugfs_create_dir("info", debugfs_resctrl);
> > +
> > + for_each_rdt_resource(r)
> > + r->arch_debug_info = debugfs_create_dir(r->name, debuginfodir);
>
> This ignores (*) several of the boundaries my response aimed to establish.
>
> Here are some red flags:
> - This creates the resource named directory and hands off that pointer to the
> arch. As I mentioned the arch should not have control over resctrl's debugfs.
> I believe this is the type of information that should be in control of resctrl fs
> since, as I mentioned, resctrl fs may need to add debugging that mirrors /sys/fs/resctrl.
> - Blindly creating these directories (a) without the resource even existing on the
> system, and (b) without being used/requested by the architecture does not create a good
> interface in my opinion. User space will see a bunch of empty directories
> associated with resources that are not present on the system.
> - The directories created do not even match /sys/fs/resctrl/info when it comes
> to the resources. Note that the directories within /sys/fs/resctrl/info are created
> from the schema for control resources and appends _MON to monitor resources. Like
> I mentioned in my earlier response there should ideally be space for a future
> resctrl fs extension to mirror layout of /sys/fs/resctrl for resctrl fs debug
> in debugfs. This solution ignores all of that.
>
> I still think that the architecture should request the debugfs directory from resctrl fs.
> This avoids resctrl fs needing to create directories/files that are never used and
> does not present user space with an empty tree. Considering that the new PERF_PKG
> resource may not come online until resctrl mount this should be something that can be
> called at any time.
>
> One possibility, that supports intended use while keeping the door open to support
> future resctrl fs use of the debugfs, could be a new resctrl fs function,
> for example resctrl_create_mon_resource_debugfs(struct rdt_resource *r), that will initialize
> rdt_resource::arch_debug_info(*) to point to the dentry of newly created
> /sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON/arch_debug_name_TBD *if*
> the associated resource is capable of monitoring ... or do you think an architecture
> may want to add debugging information before a resource is discovered/enabled?
> If doing this then rdt_resource::arch_debug_info is no longer appropriate since it needs
> to be specific to the monitoring resource. Perhaps then rdt_resource::arch_mon_debugfs
> that would eventually live in [1]?
>
> This is feeling rushed and I am sharing some top of mind ideas. I will give this
> more thought.
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/cb8425c73f57280b0b4f22e089b2912eede42f7a.1747349530.git.babu.moger@amd.com/
>
> (*) I have now asked several times to stop ignoring feedback. This should not even
> be necessary in the first place. I do not require you to agree with me and I do not claim
> to always be right, please just stop ignoring feedback. The way forward I plan to ignore
> messages that ignores feedback.
So here's a second PoC. Takes into account all of the points you make
above with the following adjustments:
1) Not adding the rdt_resource::arch_mon_debugfs field. Just returning
the "struct dentry *" looks to be adequate for existing use case.
Having the pointer in "struct resource" would be useful if some future
use case needed to access the debugfs locations from calls to
architecture code that pass in the rdt_resource pointer. Could be
added if ever needed.
2) I can't envision a need for debugfs entries for resources
pre-discovery, or when not enabled. So keep things simple for
now.
3) I think the function name resctrl_debugfs_mon_info_mkdir() is a bit
more descriptive (it is making a directory and we usually have such
functions include "mkdir" in the name).
-Tony
---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8bec8f766b01..771e69c0c5c1 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -564,6 +564,12 @@ void resctrl_arch_reset_all_ctrls(struct rdt_resource *r);
extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
+/**
+ * resctrl_debugfs_mon_info_mkdir() - Create a debugfs info directory.
+ * @r: Resource (must be mon_capable).
+ */
+struct dentry *resctrl_debugfs_mon_info_mkdir(struct rdt_resource *r);
+
int resctrl_init(void);
void resctrl_exit(void);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 8d094a3acf2f..0f11b8d0ce0b 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4344,6 +4344,22 @@ int resctrl_init(void)
return ret;
}
+struct dentry *resctrl_debugfs_mon_info_mkdir(struct rdt_resource *r)
+{
+ static struct dentry *debugfs_resctrl_info;
+ char name[32];
+
+ if (!r->mon_capable)
+ return NULL;
+
+ if (!debugfs_resctrl_info)
+ debugfs_resctrl_info = debugfs_create_dir("info", debugfs_resctrl);
+
+ sprintf(name, "%s_MON", r->name);
+
+ return debugfs_create_dir(name, debugfs_resctrl_info);
+}
+
static bool resctrl_online_domains_exist(void)
{
struct rdt_resource *r;
--
2.49.0
^ permalink raw reply related [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-09 18:49 ` Luck, Tony
@ 2025-06-09 22:39 ` Reinette Chatre
2025-06-09 23:34 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-09 22:39 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
Tony,
On 6/9/25 11:49 AM, Luck, Tony wrote:
> On Fri, Jun 06, 2025 at 02:14:56PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 6/6/25 10:30 AM, Luck, Tony wrote:
>>> On Fri, Jun 06, 2025 at 09:26:06AM -0700, Reinette Chatre wrote:
>>>> With /sys/kernel/debug/resctrl potentially mirroring /sys/fs/resctrl to
>>>> support various debugging scenarios there may later be resource level
>>>> debugging for which a "/sys/kernel/debug/resctrl/info/<resource>/<debugfile>" can
>>>> be used. Considering this it looks to me as though one possible boundary could
>>>> be to isolate arch specific debug to, for example, a new directory named
>>>> "/sys/kernel/debug/resctrl/info/arch_debug_name_tbd/". By placing the
>>>> arch debug in a sub-directory named "info" it avoids collision with resource
>>>> group names with naming that also avoids collision with resource names since
>>>> all these names are controlled by resctrl fs.
>>>
>>>
>>> That seems like a good path. PoC patch below. Note that I put the dentry
>>> for the debug info directory into struct rdt_resource. So no call from
>>> architecture to file system code needed to access.
>>
>> ok, reading between the lines there is now a switch to per-resource
>> requirement, which fits with the use.
>>
>>>
>>> Directory layout looks like this:
>>>
>>> # tree /sys/kernel/debug/resctrl/
>>> /sys/kernel/debug/resctrl/
>>> └── info
>>> ├── L2
>>> ├── L3
>>> ├── MB
>>> └── SMBA
>>>
>>
>> This looks like something that needs to be owned and managed by
>> resctrl fs (more below).
>>
>>> 6 directories, 0 files
>>>
>>> -Tony
>>>
>>> ---
>>>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 5e28e81b35f6..78dd0f8f7ad8 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -281,6 +281,7 @@ enum resctrl_schema_fmt {
>>> * @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth
>>> * monitoring events can be configured.
>>> * @cdp_capable: Is the CDP feature available on this resource
>>> + * @arch_debug_info: Debugfs info directory for architecture use
>>> */
>>> struct rdt_resource {
>>> int rid;
>>> @@ -297,6 +298,7 @@ struct rdt_resource {
>>> enum resctrl_schema_fmt schema_fmt;
>>> unsigned int mbm_cfg_mask;
>>> bool cdp_capable;
>>> + struct dentry *arch_debug_info;
>>> };
>>
>> ok ... but maybe not quite exactly (more below)
>
> Would have been useful with the "always create directories" approach.
> As you point out below the name is problematic. Would need separate
> entries for control and monitor resources like RDT_RESOURCE_L3.
>
> I don't think it is useful in the "only make directories when requested
> by architecture" mode.
>
>>>
>>> /*
>>> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
>>> index ed4fc45da346..48c587201fb6 100644
>>> --- a/fs/resctrl/rdtgroup.c
>>> +++ b/fs/resctrl/rdtgroup.c
>>> @@ -4274,6 +4274,8 @@ void resctrl_offline_cpu(unsigned int cpu)
>>> */
>>> int resctrl_init(void)
>>> {
>>> + struct dentry *debuginfodir;
>>> + struct rdt_resource *r;
>>> int ret = 0;
>>>
>>> seq_buf_init(&last_cmd_status, last_cmd_status_buf,
>>> @@ -4320,6 +4322,12 @@ int resctrl_init(void)
>>> */
>>> debugfs_resctrl = debugfs_create_dir("resctrl", NULL);
>>>
>>> + /* Create debug info directories for each resource */
>>> + debuginfodir = debugfs_create_dir("info", debugfs_resctrl);
>>> +
>>> + for_each_rdt_resource(r)
>>> + r->arch_debug_info = debugfs_create_dir(r->name, debuginfodir);
>>
>> This ignores (*) several of the boundaries my response aimed to establish.
>>
>> Here are some red flags:
>> - This creates the resource named directory and hands off that pointer to the
>> arch. As I mentioned the arch should not have control over resctrl's debugfs.
>> I believe this is the type of information that should be in control of resctrl fs
>> since, as I mentioned, resctrl fs may need to add debugging that mirrors /sys/fs/resctrl.
>> - Blindly creating these directories (a) without the resource even existing on the
>> system, and (b) without being used/requested by the architecture does not create a good
>> interface in my opinion. User space will see a bunch of empty directories
>> associated with resources that are not present on the system.
>> - The directories created do not even match /sys/fs/resctrl/info when it comes
>> to the resources. Note that the directories within /sys/fs/resctrl/info are created
>> from the schema for control resources and appends _MON to monitor resources. Like
>> I mentioned in my earlier response there should ideally be space for a future
>> resctrl fs extension to mirror layout of /sys/fs/resctrl for resctrl fs debug
>> in debugfs. This solution ignores all of that.
>>
>> I still think that the architecture should request the debugfs directory from resctrl fs.
>> This avoids resctrl fs needing to create directories/files that are never used and
>> does not present user space with an empty tree. Considering that the new PERF_PKG
>> resource may not come online until resctrl mount this should be something that can be
>> called at any time.
>>
>> One possibility, that supports intended use while keeping the door open to support
>> future resctrl fs use of the debugfs, could be a new resctrl fs function,
>> for example resctrl_create_mon_resource_debugfs(struct rdt_resource *r), that will initialize
>> rdt_resource::arch_debug_info(*) to point to the dentry of newly created
>> /sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON/arch_debug_name_TBD *if*
>> the associated resource is capable of monitoring ... or do you think an architecture
>> may want to add debugging information before a resource is discovered/enabled?
>> If doing this then rdt_resource::arch_debug_info is no longer appropriate since it needs
>> to be specific to the monitoring resource. Perhaps then rdt_resource::arch_mon_debugfs
>> that would eventually live in [1]?
>>
>> This is feeling rushed and I am sharing some top of mind ideas. I will give this
>> more thought.
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/cb8425c73f57280b0b4f22e089b2912eede42f7a.1747349530.git.babu.moger@amd.com/
>>
>> (*) I have now asked several times to stop ignoring feedback. This should not even
>> be necessary in the first place. I do not require you to agree with me and I do not claim
>> to always be right, please just stop ignoring feedback. The way forward I plan to ignore
>> messages that ignores feedback.
>
> So here's a second PoC. Takes into account all of the points you make
> above with the following adjustments:
>
> 1) Not adding the rdt_resource::arch_mon_debugfs field. Just returning
> the "struct dentry *" looks to be adequate for existing use case.
>
> Having the pointer in "struct resource" would be useful if some future
> use case needed to access the debugfs locations from calls to
> architecture code that pass in the rdt_resource pointer. Could be
> added if ever needed.
>
> 2) I can't envision a need for debugfs entries for resources
> pre-discovery, or when not enabled. So keep things simple for
> now.
>
> 3) I think the function name resctrl_debugfs_mon_info_mkdir() is a bit
> more descriptive (it is making a directory and we usually have such
> functions include "mkdir" in the name).
>
> -Tony
>
> ---
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8bec8f766b01..771e69c0c5c1 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -564,6 +564,12 @@ void resctrl_arch_reset_all_ctrls(struct rdt_resource *r);
> extern unsigned int resctrl_rmid_realloc_threshold;
> extern unsigned int resctrl_rmid_realloc_limit;
>
> +/**
> + * resctrl_debugfs_mon_info_mkdir() - Create a debugfs info directory.
> + * @r: Resource (must be mon_capable).
> + */
> +struct dentry *resctrl_debugfs_mon_info_mkdir(struct rdt_resource *r);
> +
> int resctrl_init(void);
> void resctrl_exit(void);
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 8d094a3acf2f..0f11b8d0ce0b 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -4344,6 +4344,22 @@ int resctrl_init(void)
> return ret;
> }
>
> +struct dentry *resctrl_debugfs_mon_info_mkdir(struct rdt_resource *r)
> +{
> + static struct dentry *debugfs_resctrl_info;
> + char name[32];
> +
> + if (!r->mon_capable)
> + return NULL;
> +
> + if (!debugfs_resctrl_info)
> + debugfs_resctrl_info = debugfs_create_dir("info", debugfs_resctrl);
> +
> + sprintf(name, "%s_MON", r->name);
> +
> + return debugfs_create_dir(name, debugfs_resctrl_info);
> +}
> +
> static bool resctrl_online_domains_exist(void)
> {
> struct rdt_resource *r;
Why do you keep insisting without motivation on handing control of what
should be resctrl fs managed directories to architecture? Twice have I suggested
that an arch private directory be created for the arch debugfs and every
time you create a patch without motivation where arch gets control of what
should be resctrl fs managed. Again, if my suggestions are flawed it is an
opportunity for a teaching moment, never should be ignored. I highligted that
this is not ideal in the message you are responding to. I'm done.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* RE: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-09 22:39 ` Reinette Chatre
@ 2025-06-09 23:34 ` Luck, Tony
2025-06-10 0:30 ` Reinette Chatre
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-09 23:34 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Keshavamurthy, Anil S,
Chen, Yu C, x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
Reinette,
Trimming to focus on why I was confused by your message.
>> One possibility, that supports intended use while keeping the door open to support
>> future resctrl fs use of the debugfs, could be a new resctrl fs function,
>> for example resctrl_create_mon_resource_debugfs(struct rdt_resource *r), that will initialize
>> rdt_resource::arch_debug_info(*) to point to the dentry of newly created
>> /sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON/arch_debug_name_TBD *if*
>> the associated resource is capable of monitoring
What exactly is this dentry pointing to? I was mistakenly of the impression that it was a directory.
Now I think that you intend it to be a single file with a name chosen by filesystem code.
Is that right?
If so, there needs to be "umode_t mode" and "struct file_operations *fops" arguments
for architecture to say whether this file is readable, writeable, and most importantly
to specify the architecture functions to be called when the user accesses this file.
With added "mode" and "fops" arguments this proposal meets my needs.
Choosing the exact string for the "arch_debug_name_TBD" file name that
will be given to any other users needs some thought. I was planning on
simply "status" since the information that I want to convey is read-only
status about each of the telemetry collection aggregators. But that feels
like it might be limiting if a future use includes any control options by
providing a writable file.
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-09 23:34 ` Luck, Tony
@ 2025-06-10 0:30 ` Reinette Chatre
2025-06-10 18:48 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Reinette Chatre @ 2025-06-10 0:30 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Keshavamurthy, Anil S,
Chen, Yu C, x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
On 6/9/25 4:34 PM, Luck, Tony wrote:
> Reinette,
>
> Trimming to focus on why I was confused by your message.
>
>>> One possibility, that supports intended use while keeping the door open to support
>>> future resctrl fs use of the debugfs, could be a new resctrl fs function,
>>> for example resctrl_create_mon_resource_debugfs(struct rdt_resource *r), that will initialize
>>> rdt_resource::arch_debug_info(*) to point to the dentry of newly created
>>> /sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON/arch_debug_name_TBD *if*
>>> the associated resource is capable of monitoring
>
> What exactly is this dentry pointing to? I was mistakenly of the impression that it was a directory.
Yes, it has been directory since https://lore.kernel.org/lkml/9eb9a466-2895-405a-91f7-cda75e75f7ae@intel.com/
If your impression was indeed that it was a directory then why did your patch not
create a directory?
I am now going to repeat what I said in https://lore.kernel.org/lkml/9eb9a466-2895-405a-91f7-cda75e75f7ae@intel.com/
>
> Now I think that you intend it to be a single file with a name chosen by filesystem code.
>
> Is that right?
Not what I have been saying, no.
>
> If so, there needs to be "umode_t mode" and "struct file_operations *fops" arguments
> for architecture to say whether this file is readable, writeable, and most importantly
> to specify the architecture functions to be called when the user accesses this file.
>
> With added "mode" and "fops" arguments this proposal meets my needs.
>
> Choosing the exact string for the "arch_debug_name_TBD" file name that
This should be a directory, a directory owned by the arch where it can create
debug infrastructure required by arch. The directory name chosen and
assigned by resctrl fs, while arch has freedom to create more directories
and add files underneath it. Goal is to isolate all arch specific debug to
a known location.
Again, we need to prepare for resctrl fs to potentially use debugfs for its own
debug and when it does this the expectation is that the layout will mirror
/sys/fs/resctrl. Creating a directory /sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON
and then handing it off to the arch goes *against* this. It gives arch
control over a directory that should be owned by resctrl fs.
What I have been trying to propose is that resctrl fs create a directory
/sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON/arch_debug_name_TBD and hand
a dentry pointer to it to the arch where it can do what is needed to support its debugging needs.
Isn't this exactly what I wrote in the snippet above? Above you respond with
statement that you were under impression that it was a directory ... and then
send a patch that does something else. I am so confused. Gaslighting is
beneath you.
> will be given to any other users needs some thought. I was planning on
> simply "status" since the information that I want to convey is read-only
> status about each of the telemetry collection aggregators. But that feels
> like it might be limiting if a future use includes any control options by
> providing a writable file.
The file containing the debug information for this feature will be added within
the directory that we are talking about here.
Reinette
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters
2025-06-06 16:56 ` Reinette Chatre
@ 2025-06-10 15:16 ` Dave Martin
2025-06-10 15:54 ` Luck, Tony
0 siblings, 1 reply; 90+ messages in thread
From: Dave Martin @ 2025-06-10 15:16 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 06, 2025 at 09:56:25AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 6/6/25 9:25 AM, Luck, Tony wrote:
> > On Tue, Jun 03, 2025 at 08:49:08PM -0700, Reinette Chatre wrote:
> >>> + sprintf(buf, "%0*llu", fp->decplaces, frac);
> >>
> >> I'm a bit confused here. I see fp->decplaces as the field width and the "0" indicates
> >> that the value is zero padded on the _left_. I interpret this to mean that, for example,
> >> if the value of frac is 42 then it will be printed as "0042". The fraction's value is modified
> >> (it is printed as "0.0042") and there are no trailing zeroes to remove. What am I missing?
> >
> > An example may help. Suppose architecture is providing 18 binary place
> > numbers, and delivers the value 0x60000 to be displayed. With 18 binary
> > places filesystem chooses 6 decimal places (I'll document the rationale
> > for this choice in comments in next version). In binary the value looks
> > like this:
> >
> > integer binary_places
> > 1 100000000000000000
> >
> > Running through my algorithm will end with "frac" = 500000 (decimal).
> >
> > Thus there are *trailing* zeroes. The value should be displayed as
> > "1.5" not as "1.500000".
>
> Instead of a counter example, could you please make it obvious through
> the algorithm description and/or explanation of decimal place choice how
> "frac" is guaranteed to never be smaller than "decplaces"?
>
> Reinette
Trying to circumvent this...
Why do these conversions need to be done in the kernel at all?
Can't we just tell userspace the scaling factor and expose the
parameter as an integer?
In your example, this above value would be exposed as
0b110_0000_0000_0000_0000 / 0b100_0000_0000_0000_0000
(= 0x60000 / 0x40000)
This has the advantage that the data exchanged with userspace is exact,
(so far as the hardware permits, anyway) and there is no unnecessary
cost or complexity in the kernel.
Since userspace is probably some kind of scripting language, it can do
scaling conversions and pretty-print tables more straightforwardly
than the kernel can -- if it wants to. But it can also work in the
native representation with no introduction of rounding errors, and do
conversions only when necessary rather than every time a value crosses
the user/kernel boundary...
Cheers
---Dave
^ permalink raw reply [flat|nested] 90+ messages in thread
* RE: [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters
2025-06-10 15:16 ` Dave Martin
@ 2025-06-10 15:54 ` Luck, Tony
2025-06-12 16:19 ` Dave Martin
0 siblings, 1 reply; 90+ messages in thread
From: Luck, Tony @ 2025-06-10 15:54 UTC (permalink / raw)
To: Dave Martin, Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Keshavamurthy, Anil S, Chen, Yu C,
x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
> Trying to circumvent this...
>
> Why do these conversions need to be done in the kernel at all?
>
> Can't we just tell userspace the scaling factor and expose the
> parameter as an integer?
>
> In your example, this above value would be exposed as
>
> 0b110_0000_0000_0000_0000 / 0b100_0000_0000_0000_0000
>
> (= 0x60000 / 0x40000)
>
> This has the advantage that the data exchanged with userspace is exact,
> (so far as the hardware permits, anyway) and there is no unnecessary
> cost or complexity in the kernel.
>
> Since userspace is probably some kind of scripting language, it can do
> scaling conversions and pretty-print tables more straightforwardly
> than the kernel can -- if it wants to. But it can also work in the
> native representation with no introduction of rounding errors, and do
> conversions only when necessary rather than every time a value crosses
> the user/kernel boundary...
It seems user hostile to print 8974832975 with some info file to explain that
the scaling factor is 262144. While it may be common to read using some
special tool, it make life harder for casual scripts.
Printing that value as 34236.270809 makes it simple for all tools.
The rounding error from the kernel is insignificant ("true" value is
34236.270809173583984375 ... so the error is around five parts
per trillion).
Things are worse sampling the Joule values once per-second to convert
to Watts. But even there the rounding errors from a 1-Watt workload
are tiny. Worst case you see 0.999999 followed by 2.000001 one second
later and report as 1.000002 Watts instead of 1.0
The error bars on the values computed by hardware are enormously
larger than this. Further compounded by the telemetry update rate
of 2 millliseconds. Errors from uncertainty in when the value was
captured are also orders of magnitude higher than kernel rounding
errors.
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file
2025-06-10 0:30 ` Reinette Chatre
@ 2025-06-10 18:48 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-06-10 18:48 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Keshavamurthy, Anil S,
Chen, Yu C, x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
On Mon, Jun 09, 2025 at 05:30:34PM -0700, Reinette Chatre wrote:
> This should be a directory, a directory owned by the arch where it can create
> debug infrastructure required by arch. The directory name chosen and
> assigned by resctrl fs, while arch has freedom to create more directories
> and add files underneath it. Goal is to isolate all arch specific debug to
> a known location.
>
> Again, we need to prepare for resctrl fs to potentially use debugfs for its own
> debug and when it does this the expectation is that the layout will mirror
> /sys/fs/resctrl. Creating a directory /sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON
> and then handing it off to the arch goes *against* this. It gives arch
> control over a directory that should be owned by resctrl fs.
>
> What I have been trying to propose is that resctrl fs create a directory
> /sys/kernel/debug/resctrl/info/<rdt_resource::name>_MON/arch_debug_name_TBD and hand
> a dentry pointer to it to the arch where it can do what is needed to support its debugging needs.
> Isn't this exactly what I wrote in the snippet above? Above you respond with
> statement that you were under impression that it was a directory ... and then
> send a patch that does something else. I am so confused. Gaslighting is
> beneath you.
For the precise name of the "arch_debug_name_TBD" directory, is simply "arch"
sufficient? That leaves every other name available for resctrl
filesystem code free choice if it does add some debug files here.
Or would $ARCH ("x86" in my case) be better to keep distinct debug name
spaces between architectures?
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters
2025-06-10 15:54 ` Luck, Tony
@ 2025-06-12 16:19 ` Dave Martin
0 siblings, 0 replies; 90+ messages in thread
From: Dave Martin @ 2025-06-12 16:19 UTC (permalink / raw)
To: Luck, Tony
Cc: Chatre, Reinette, Fenghua Yu, Wieczor-Retman, Maciej,
Peter Newman, James Morse, Babu Moger, Drew Fustini,
Keshavamurthy, Anil S, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
Hi,
On Tue, Jun 10, 2025 at 03:54:35PM +0000, Luck, Tony wrote:
> > Trying to circumvent this...
> >
> > Why do these conversions need to be done in the kernel at all?
> >
> > Can't we just tell userspace the scaling factor and expose the
> > parameter as an integer?
> >
> > In your example, this above value would be exposed as
> >
> > 0b110_0000_0000_0000_0000 / 0b100_0000_0000_0000_0000
> >
> > (= 0x60000 / 0x40000)
> >
> > This has the advantage that the data exchanged with userspace is exact,
> > (so far as the hardware permits, anyway) and there is no unnecessary
> > cost or complexity in the kernel.
> >
> > Since userspace is probably some kind of scripting language, it can do
> > scaling conversions and pretty-print tables more straightforwardly
> > than the kernel can -- if it wants to. But it can also work in the
> > native representation with no introduction of rounding errors, and do
> > conversions only when necessary rather than every time a value crosses
> > the user/kernel boundary...
>
> It seems user hostile to print 8974832975 with some info file to explain that
> the scaling factor is 262144. While it may be common to read using some
> special tool, it make life harder for casual scripts.
>
> Printing that value as 34236.270809 makes it simple for all tools.
The divisor is going to be a power of two or a power of ten in
practice, and I think most technical users are fairly used to looking
at values scaled by those -- so I'm not convinced that this is quite as
bad as you suggest.
The choice of unit in the interface is still arbitrary, and the kernel
is already inconsistent with itself in this area, so I think userspace
is often going to have to do some scaling conversions anyway.
resctrl is not (necessarily) a user interface, but I agree that it is
no bad thing for make the output eyeball-friendly, so long is the cost
of doing so is reasonable.
(Plenty of virtual "text" files exported by the kernel are extremely
cryptic and user-hostile, though.)
> The rounding error from the kernel is insignificant ("true" value is
> 34236.270809173583984375 ... so the error is around five parts
> per trillion).
>
> Things are worse sampling the Joule values once per-second to convert
> to Watts. But even there the rounding errors from a 1-Watt workload
> are tiny. Worst case you see 0.999999 followed by 2.000001 one second
> later and report as 1.000002 Watts instead of 1.0
>
> The error bars on the values computed by hardware are enormously
> larger than this. Further compounded by the telemetry update rate
> of 2 millliseconds. Errors from uncertainty in when the value was
> captured are also orders of magnitude higher than kernel rounding
> errors.
>
> -Tony
If we can make the intermediate interface error-free by construction
and without making life especially hard for anyone, then that means we
can bolt whatever on at each end without having to even think about the
effect on accuracy.
I agree though that the inaccuracies introduced by the interface will
be very marginal, and likely swamped by hardware limitations and timing
skid.
Either way, it's not my call...
Cheers
---Dave
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
` (29 preceding siblings ...)
2025-05-28 17:21 ` [PATCH v5 00/29] x86/resctrl telemetry monitoring Reinette Chatre
@ 2025-06-13 16:57 ` James Morse
2025-06-13 18:50 ` Luck, Tony
30 siblings, 1 reply; 90+ messages in thread
From: James Morse @ 2025-06-13 16:57 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman,
Peter Newman, Babu Moger, Drew Fustini, Dave Martin,
Anil Keshavamurthy, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
I'm still going through this, but here is my attempt to describe what equivalents arm has
in this area.
On 21/05/2025 23:50, Tony Luck wrote:
> Background
> ----------
>
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOBMSM device to be
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.
This is a shared memory area where an external agent (the OOBMSM) has logged measurement data?
Arm's equivalent to this is two things.
For things close to the CPU (e.g. stalls_llc_miss) these would be an PMU (possibly uncore
PMU) which follow a convention on their register layout meaning the general purpose pmu
driver should be able to drive them. The meaning of the events is described to user-space
via the perf json file. The kernel knows how to read event:6, but not what event:6 means.
The spec for this mentions MPAM, values can be monitored by ~RMID, but none of this is
managed by the MPAM driver.
The other thing arm has that is a bit like this is SCMI, which is a packet format for
talking to an on-die microcontroller to get platform specific temperature, voltage and
clock values. Again, this is another bit of kernel infrastructure that has its own way of
doing things. I don't see this filtering things by ~RMID ... but I guess its possible.
That can have shared memory areas (termed 'fast channels'). I think they are an array of
counter values, and something in the packet stream tells you which one is which.
Neither of these need picking up by the MPAM driver to expose via resctrl. But I'd like to
get that information across where possible so that user-space can be portable.
> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package. In the initial
> hardware implementation, there are two categories of events: energy
> and perf.
>
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.
>
> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).
> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with much lower overhead (no need to collect data
> on every context switch).
>
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC
>
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks
>
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles
>
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches
>
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
>
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
>
> uops_retired - Counts the total number of uops retired
>
> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
>
> Offset: Counter
> 0x00 core energy for RMID 0
> 0x08 core activity for RMID 0
> 0x10 core energy for RMID 1
> 0x18 core activity for RMID 1
> ...
For the performance counters especially, on arm I'd be trying to get these values by
teaching perf about the CLOSID/RMID values, so that perf events are only incremented for
tasks in a particular control/monitor group.
(why that might be relevant is below)
> Resctrl User Interface
> ----------------------
>
> Because there may be multiple OOBMSM collection agents per processor
> package, resctrl accumulates event counts from all agents on a package
> and presents a single value to users. This will provide a consistent
> user interface on future platforms that vary the number of collectors,
> or the mappings from logical CPUs to collectors.
Great!
> Users will continue to see the legacy monitoring files in the "L3"
> directories and the telemetry files in the new "PERF_PKG" directories
> (with each file providing the aggregated value from all OOBMSM collectors
> on that package).
>
> $ tree /sys/fs/resctrl/mon_data/
> /sys/fs/resctrl/mon_data/
> ├── mon_L3_00
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_PERF_PKG_00
Where do the package ids come from? How can user-space find out which CPUs are in package-0?
I don't see a package_id in either /sys/devices/system/cpu/cpu0/topology or
Documentation/ABI/stable/sysfs-devices-system-cpu.
> │ ├── activity
> │ ├── c1_res
> │ ├── c6_res
> │ ├── core_energy
> │ ├── stalls_llc_hit
> │ ├── stalls_llc_miss
> │ ├── unhalted_core_cycles
> │ ├── unhalted_ref_cycles
> │ └── uops_retired
> └── mon_PERF_PKG_01
> ├── activity
> ├── c1_res
> ├── c6_res
> ├── core_energy
> ├── stalls_llc_hit
> ├── stalls_llc_miss
> ├── unhalted_core_cycles
> ├── unhalted_ref_cycles
> └── uops_retired
Looks good to me.
The difficulty MPAM platforms have had with mbm_total_bytes et al is the "starts counting
from the beginning of time" property. Having to enable mbm_total_bytes before it counts
would have allowed MPAM to report an error if it couldn't enable more than N counters at a
time. (ABMC suggests AMD platforms have a similar problem).
How do you feel about having to enable these before they start counting?
This would allow the MPAM driver to open the event via perf if it has a corresponding
feature/counter, then provide the value from perf via resctrl.
Another headache is how we describe the format of the contents of these files... a made
up example: residency counts could be in absolute time, or percentages. I've been bitten
by the existing schemata strings being implicitly in a particular format, meaning
conversions have to happen. I'm not sure whether some architecture/platform would trip
over the same problem here.
> Resctrl Implementation
> ----------------------
>
> The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
> that returns an array of structures describing the per-RMID groups it
> found from the VSEC enumeration. Linux looks at the unique identifiers
> for each group and enables resctrl for all groups with known unique
> identifiers.
>
> The memory map for the counters for each <RMID, event> pair is described
> by the XML file. This is too unwieldy to use in the Linux kernel, so a
> simplified representation is built into the resctrl code.
(I hope there are only a few combinations!)
> Note that the
> counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> calls to read counters from a CPU in a specific domain.
Huzzah! RISC-V has this property, and many MPAM platforms do, (...but not all...)
> The counters can be read from any CPU.
>
> High level description of code changes:
>
> 1) New scope RESCTRL_PACKAGE
> 2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
> 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
> switch (r->rid) {
> case RDT_RESOURCE_L3:
> helper for L3
> break;
> case RDT_RESOURCE_PERF_PKG:
> helper for PKG
> break;
> }
> 4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.
>
> With only one platform providing this feature, it's tricky to tell
> exactly where it is going to go. I've made the event definitions
> platform specific (based on the unique ID from the VSEC enumeration). It
> seems possible/likely that the list of events may change from generation
> to generation.
My thinking about this from a perf angle was to have named events for those things that
resctrl supports, but allow events to be specified by number, and funnel those through
resctrl_arch_rmid_read() so that the arch code can interpret them as a counter type-id or
an offset in some array. The idea was to allow platform specific counters to be read
without any kernel changes, reducing the pressure to add resctrl support for counters that
may only ever be present in a single platform.
With the XML data you have, would it be possible to add new 'events' to this interface via
sysfs/configfs? Or does too much depend on the data identified by that GUID...
Thanks,
James
^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring
2025-06-13 16:57 ` James Morse
@ 2025-06-13 18:50 ` Luck, Tony
0 siblings, 0 replies; 90+ messages in thread
From: Luck, Tony @ 2025-06-13 18:50 UTC (permalink / raw)
To: James Morse
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
Babu Moger, Drew Fustini, Dave Martin, Anil Keshavamurthy,
Chen Yu, x86, linux-kernel, patches
On Fri, Jun 13, 2025 at 05:57:26PM +0100, James Morse wrote:
> Hi Tony,
>
> I'm still going through this, but here is my attempt to describe what equivalents arm has
> in this area.
>
>
> On 21/05/2025 23:50, Tony Luck wrote:
> > Background
> > ----------
> >
> > Telemetry features are being implemented in conjunction with the
> > IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> > counts for various events to a collector in a nearby OOBMSM device to be
> > accumulated with counts for each <RMID, event> pair received from other
> > CPUs. Cores send event counts when the RMID value changes, or after each
> > 2ms elapsed time.
>
> This is a shared memory area where an external agent (the OOBMSM) has logged measurement data?
Yes. Effectively shared memory (but in another address space so need to
use readq(addr) rather than just *addr to read values to keep sparse
happy).
> Arm's equivalent to this is two things.
> For things close to the CPU (e.g. stalls_llc_miss) these would be an PMU (possibly uncore
> PMU) which follow a convention on their register layout meaning the general purpose pmu
> driver should be able to drive them. The meaning of the events is described to user-space
> via the perf json file. The kernel knows how to read event:6, but not what event:6 means.
> The spec for this mentions MPAM, values can be monitored by ~RMID, but none of this is
> managed by the MPAM driver.
>
> The other thing arm has that is a bit like this is SCMI, which is a packet format for
> talking to an on-die microcontroller to get platform specific temperature, voltage and
> clock values. Again, this is another bit of kernel infrastructure that has its own way of
> doing things. I don't see this filtering things by ~RMID ... but I guess its possible.
> That can have shared memory areas (termed 'fast channels'). I think they are an array of
> counter values, and something in the packet stream tells you which one is which.
>
>
> Neither of these need picking up by the MPAM driver to expose via resctrl. But I'd like to
> get that information across where possible so that user-space can be portable.
>
>
> > Each OOBMSM device may implement multiple event collectors with each
> > servicing a subset of the logical CPUs on a package. In the initial
> > hardware implementation, there are two categories of events: energy
> > and perf.
> >
> > 1) Energy - Two counters
> > core_energy: This is an estimate of Joules consumed by each core. It is
> > calculated based on the types of instructions executed, not from a power
> > meter. This counter is useful to understand how much energy a workload
> > is consuming.
> >
> > activity: This measures "accumulated dynamic capacitance". Users who
> > want to optimize energy consumption for a workload may use this rather
> > than core_energy because it provides consistent results independent of
> > any frequency or voltage changes that may occur during the runtime of
> > the application (e.g. entry/exit from turbo mode).
>
> > 2) Performance - Seven counters
> > These are similar events to those available via the Linux "perf" tool,
> > but collected in a way with much lower overhead (no need to collect data
> > on every context switch).
> >
> > stalls_llc_hit - Counts the total number of unhalted core clock cycles
> > when the core is stalled due to a demand load miss which hit in the LLC
> >
> > c1_res - Counts the total C1 residency across all cores. The underlying
> > counter increments on 100MHz clock ticks
> >
> > unhalted_core_cycles - Counts the total number of unhalted core clock
> > cycles
> >
> > stalls_llc_miss - Counts the total number of unhalted core clock cycles
> > when the core is stalled due to a demand load miss which missed all the
> > local caches
> >
> > c6_res - Counts the total C6 residency. The underlying counter increments
> > on crystal clock (25MHz) ticks
> >
> > unhalted_ref_cycles - Counts the total number of unhalted reference clock
> > (TSC) cycles
> >
> > uops_retired - Counts the total number of uops retired
> >
> > The counters are arranged in groups in MMIO space of the OOBMSM device.
> > E.g. for the energy counters the layout is:
> >
> > Offset: Counter
> > 0x00 core energy for RMID 0
> > 0x08 core activity for RMID 0
> > 0x10 core energy for RMID 1
> > 0x18 core activity for RMID 1
> > ...
>
> For the performance counters especially, on arm I'd be trying to get these values by
> teaching perf about the CLOSID/RMID values, so that perf events are only incremented for
> tasks in a particular control/monitor group.
> (why that might be relevant is below)
Yes. If perf is enhanced to take CLOSID/RMID into account when
accumulating event counts it can provide the same functionality.
Higher overhead since perf needs to sample event counters of
interest on every context switch instead of data collection
being handled by hardware.
On the other hand the perf approach is more flexible as you can
pick any event to sample per-RMID instead of the fixed set that
the h/w designer chose.
>
> > Resctrl User Interface
> > ----------------------
> >
> > Because there may be multiple OOBMSM collection agents per processor
> > package, resctrl accumulates event counts from all agents on a package
> > and presents a single value to users. This will provide a consistent
> > user interface on future platforms that vary the number of collectors,
> > or the mappings from logical CPUs to collectors.
>
> Great!
>
>
> > Users will continue to see the legacy monitoring files in the "L3"
> > directories and the telemetry files in the new "PERF_PKG" directories
> > (with each file providing the aggregated value from all OOBMSM collectors
> > on that package).
> >
> > $ tree /sys/fs/resctrl/mon_data/
> > /sys/fs/resctrl/mon_data/
> > ├── mon_L3_00
> > │ ├── llc_occupancy
> > │ ├── mbm_local_bytes
> > │ └── mbm_total_bytes
> > ├── mon_L3_01
> > │ ├── llc_occupancy
> > │ ├── mbm_local_bytes
> > │ └── mbm_total_bytes
>
> > ├── mon_PERF_PKG_00
>
> Where do the package ids come from? How can user-space find out which CPUs are in package-0?
Resctrl gets the id from topology_physical_package_id(cpu);
>
> I don't see a package_id in either /sys/devices/system/cpu/cpu0/topology or
> Documentation/ABI/stable/sysfs-devices-system-cpu.
These package IDs show up on x86 with these file names:
$ grep ^ /sys/devices/system/cpu/cpu0/topology/*package*
/sys/devices/system/cpu/cpu0/topology/package_cpus:0000,00000fff,ffffff00,0000000f,ffffffff
/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-35,72-107
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
>
> > │ ├── activity
> > │ ├── c1_res
> > │ ├── c6_res
> > │ ├── core_energy
> > │ ├── stalls_llc_hit
> > │ ├── stalls_llc_miss
> > │ ├── unhalted_core_cycles
> > │ ├── unhalted_ref_cycles
> > │ └── uops_retired
> > └── mon_PERF_PKG_01
> > ├── activity
> > ├── c1_res
> > ├── c6_res
> > ├── core_energy
> > ├── stalls_llc_hit
> > ├── stalls_llc_miss
> > ├── unhalted_core_cycles
> > ├── unhalted_ref_cycles
> > └── uops_retired
>
> Looks good to me.
>
> The difficulty MPAM platforms have had with mbm_total_bytes et al is the "starts counting
> from the beginning of time" property. Having to enable mbm_total_bytes before it counts
> would have allowed MPAM to report an error if it couldn't enable more than N counters at a
> time. (ABMC suggests AMD platforms have a similar problem).
Resctrl goes to some lengths to have mbm_total_bytes start from zero
when you mkdir a group even when some old RMID is re-used that has
got some left over value from its previous lifetime. This isn't
overly painful because resctrl has to carry lots of per-RMID state
to handle the wraparound of the narrow counters.
The Intel telemetry counters are 63 bits (lose one bit for the VALID
indication). So wrap around is no concern at all for most of them
as it happens in centuries/millennia. Potentially the uops_retired
counter might wrap in months, but that only happens if every logical
CPU is running with the same RMID for that whole time. So I've chosen
to ignore wraparound. As a result counters don't start from zero when
a group is created. I don't see this as an issue because all use cases
are "read a counter; wait some interval; re-read the counter; compute
the rate" which doesn't require starting from zero.
>
> How do you feel about having to enable these before they start counting?
>
> This would allow the MPAM driver to open the event via perf if it has a corresponding
> feature/counter, then provide the value from perf via resctrl.
You'd have resctrl report "Unavailable" for these until connecting the
plumbing to perf to provide data?
>
> Another headache is how we describe the format of the contents of these files... a made
> up example: residency counts could be in absolute time, or percentages. I've been bitten
> by the existing schemata strings being implicitly in a particular format, meaning
> conversions have to happen. I'm not sure whether some architecture/platform would trip
> over the same problem here.
Reinette is adamant that format of each resctrl event file must be
fixed. So if different systems report residency in different ways,
you'd either have to convert to some common format, or if that isn't
possible, those would have to appear in resctrl as different filenames.
E.g. "residency_absolute" and "residency_percentage".
>
> > Resctrl Implementation
> > ----------------------
> >
> > The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
> > that returns an array of structures describing the per-RMID groups it
> > found from the VSEC enumeration. Linux looks at the unique identifiers
> > for each group and enables resctrl for all groups with known unique
> > identifiers.
> >
> > The memory map for the counters for each <RMID, event> pair is described
> > by the XML file. This is too unwieldy to use in the Linux kernel, so a
> > simplified representation is built into the resctrl code.
>
> (I hope there are only a few combinations!)
Almost certain to have a new description for each CPU generation since
the number of RMIDs is embedded in the description. If the overall
structure stays the same, then each new instance is described by a
dozen or so lines of code to initialize a data structure, so I think
an acceptable level of pain.
>
> > Note that the
> > counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> > and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> > calls to read counters from a CPU in a specific domain.
>
> Huzzah! RISC-V has this property, and many MPAM platforms do, (...but not all...)
>
>
> > The counters can be read from any CPU.
> >
> > High level description of code changes:
> >
> > 1) New scope RESCTRL_PACKAGE
> > 2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
> > 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
> > switch (r->rid) {
> > case RDT_RESOURCE_L3:
> > helper for L3
> > break;
> > case RDT_RESOURCE_PERF_PKG:
> > helper for PKG
> > break;
> > }
> > 4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.
> >
> > With only one platform providing this feature, it's tricky to tell
> > exactly where it is going to go. I've made the event definitions
> > platform specific (based on the unique ID from the VSEC enumeration). It
> > seems possible/likely that the list of events may change from generation
> > to generation.
>
> My thinking about this from a perf angle was to have named events for those things that
> resctrl supports, but allow events to be specified by number, and funnel those through
> resctrl_arch_rmid_read() so that the arch code can interpret them as a counter type-id or
> an offset in some array. The idea was to allow platform specific counters to be read
> without any kernel changes, reducing the pressure to add resctrl support for counters that
> may only ever be present in a single platform.
>
> With the XML data you have, would it be possible to add new 'events' to this interface via
> sysfs/configfs? Or does too much depend on the data identified by that GUID...
Maybe. There are a number of parameters that need to be provided for an
event:
1) Scope. Must be something that resctrl already knows (and is actively
using?) L2/L3 cache, node, package.
2) Base address(es) for counters for this event.
3) Parameters for F(RMID) to compute offset from base
4) Type of access (mmio_read, MSR_read, other?)
5) post-process needed after read (check for valid? maybe other things)
Might be best to try some PoC implementation to see which of those are
required vs. overkill. Also to see whatI missed from the list.
>
>
> Thanks,
>
> James
-Tony
^ permalink raw reply [flat|nested] 90+ messages in thread
end of thread, other threads:[~2025-06-13 18:50 UTC | newest]
Thread overview: 90+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21 22:50 [PATCH v5 00/29] x86/resctrl telemetry monitoring Tony Luck
2025-05-21 22:50 ` [PATCH v5 01/29] x86,fs/resctrl: Consolidate monitor event descriptions Tony Luck
2025-06-04 3:25 ` Reinette Chatre
2025-06-04 16:33 ` Luck, Tony
2025-06-04 18:24 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 02/29] x86,fs/resctrl: Replace architecture event enabled checks Tony Luck
2025-06-04 3:26 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 03/29] x86/resctrl: Remove 'rdt_mon_features' global variable Tony Luck
2025-06-04 3:27 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 04/29] x86,fs/resctrl: Prepare for more monitor events Tony Luck
2025-05-23 9:00 ` Peter Newman
2025-05-23 15:57 ` Luck, Tony
2025-06-04 3:29 ` Reinette Chatre
2025-06-07 0:45 ` Fenghua Yu
2025-06-08 21:59 ` Luck, Tony
2025-05-21 22:50 ` [PATCH v5 05/29] x86/rectrl: Fake OOBMSM interface Tony Luck
2025-05-23 23:38 ` Reinette Chatre
2025-05-27 20:25 ` [PATCH v5 05/29 UPDATED] x86/resctrl: " Tony Luck
2025-05-21 22:50 ` [PATCH v5 06/29] x86,fs/resctrl: Improve domain type checking Tony Luck
2025-06-04 3:31 ` Reinette Chatre
2025-06-04 22:58 ` Luck, Tony
2025-06-04 23:40 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 07/29] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
2025-06-04 3:32 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 08/29] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
2025-05-21 22:50 ` [PATCH v5 09/29] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
2025-06-04 3:32 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 10/29] x86/resctrl: Change generic domain functions to use struct rdt_domain_hdr Tony Luck
2025-05-22 0:01 ` Keshavamurthy, Anil S
2025-05-22 0:15 ` Luck, Tony
2025-06-04 3:37 ` Reinette Chatre
2025-06-07 0:52 ` Fenghua Yu
2025-06-08 22:02 ` Luck, Tony
2025-05-21 22:50 ` [PATCH v5 11/29] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
2025-06-04 3:40 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 12/29] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
2025-05-21 22:50 ` [PATCH v5 13/29] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
2025-06-04 3:42 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 14/29] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
2025-06-04 3:49 ` Reinette Chatre
2025-06-06 16:25 ` Luck, Tony
2025-06-06 16:56 ` Reinette Chatre
2025-06-10 15:16 ` Dave Martin
2025-06-10 15:54 ` Luck, Tony
2025-06-12 16:19 ` Dave Martin
2025-05-21 22:50 ` [PATCH v5 15/29] fs/resctrl: Add an architectural hook called for each mount Tony Luck
2025-06-04 3:49 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 16/29] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
2025-05-21 22:50 ` [PATCH v5 17/29] x86/resctrl: Discover hardware telemetry events Tony Luck
2025-06-04 3:53 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 18/29] x86/resctrl: Count valid telemetry aggregators per package Tony Luck
2025-06-04 3:54 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 19/29] x86/resctrl: Complete telemetry event enumeration Tony Luck
2025-06-04 4:05 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 20/29] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
2025-06-04 3:57 ` Reinette Chatre
2025-06-07 0:57 ` Fenghua Yu
2025-06-08 22:05 ` Luck, Tony
2025-05-21 22:50 ` [PATCH v5 21/29] x86/resctrl: x86/resctrl: Read core telemetry events Tony Luck
2025-06-04 4:02 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 22/29] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
2025-06-04 4:06 ` Reinette Chatre
2025-06-07 0:54 ` Fenghua Yu
2025-06-08 22:03 ` Luck, Tony
2025-05-21 22:50 ` [PATCH v5 23/29] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
2025-05-21 22:50 ` [PATCH v5 24/29] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
2025-06-04 4:10 ` Reinette Chatre
2025-06-06 23:55 ` Fenghua Yu
2025-06-08 21:52 ` Luck, Tony
2025-05-21 22:50 ` [PATCH v5 25/29] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
2025-06-04 4:13 ` Reinette Chatre
2025-05-21 22:50 ` [PATCH v5 26/29] x86,fs/resctrl: Move RMID initialization to first mount Tony Luck
2025-05-21 22:50 ` [PATCH v5 27/29] fs/resctrl: Add file system mechanism for architecture info file Tony Luck
2025-06-04 4:15 ` Reinette Chatre
2025-06-06 0:09 ` Luck, Tony
2025-06-06 16:26 ` Reinette Chatre
2025-06-06 17:30 ` Luck, Tony
2025-06-06 21:14 ` Reinette Chatre
2025-06-09 18:49 ` Luck, Tony
2025-06-09 22:39 ` Reinette Chatre
2025-06-09 23:34 ` Luck, Tony
2025-06-10 0:30 ` Reinette Chatre
2025-06-10 18:48 ` Luck, Tony
2025-05-21 22:50 ` [PATCH v5 28/29] x86/resctrl: Add info/PERF_PKG_MON/status file Tony Luck
2025-05-21 22:50 ` [PATCH v5 29/29] x86/resctrl: Update Documentation for package events Tony Luck
2025-05-28 17:21 ` [PATCH v5 00/29] x86/resctrl telemetry monitoring Reinette Chatre
2025-05-28 21:38 ` Luck, Tony
2025-05-28 22:21 ` Reinette Chatre
2025-06-13 16:57 ` James Morse
2025-06-13 18:50 ` Luck, Tony
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).