patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Tony Luck <tony.luck@intel.com>
To: Fenghua Yu <fenghuay@nvidia.com>,
	Reinette Chatre <reinette.chatre@intel.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	Drew Fustini <dfustini@baylibre.com>,
	Dave Martin <Dave.Martin@arm.com>,
	Anil Keshavamurthy <anil.s.keshavamurthy@intel.com>,
	Chen Yu <yu.c.chen@intel.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev, Tony Luck <tony.luck@intel.com>
Subject: [PATCH v4 00/31] x86/resctrl telemetry monitoring
Date: Mon, 28 Apr 2025 17:33:26 -0700	[thread overview]
Message-ID: <20250429003359.375508-1-tony.luck@intel.com> (raw)

These patches are based on James Morse's latest patch set to:

  "Move the resctrl filesystem code to /fs/resctrl"

posted here:

Link: https://lore.kernel.org/all/20250425173809.5529-1-james.morse@arm.com/

Also available in the "mpam/move_to_fs/v9_final" branch of:
Link: git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git

I've pushed combination of James' series plus these patches to the
rdt-aet-v4 branch at:

Link: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git

Extensive changes (based on feedback from Reinette) since v3 was posted here:

Link: https://lore.kernel.org/all/20250407234032.241215-1-tony.luck@intel.com/

Major changes:

1) Instead of using bits in the architecture local "rdt_mon_features"
variable to keep track of enabled monitor events, use mon_evt::enabled
to track at the file system layer. Architecture informs file system
which events are enabled. This means that file system no longer needs
any of the resctrl_arch_is_*_enabled() calls to architecture as it now
has the array of mon_evt structures to check. This is one step in making
the mon_evt structure the source of all information about each event.

2) Split the v3 "Prepare for more monitor events" patch into three
easier to digest pieces.

3) Simplified the "Improve domain type checking" patch by making
the rdt_resource type its own field in the rdt_domain_hdr structure
instead of encoding it in a bit field combined with the CTRL/MON type.

4) Added "l3" to a bunch of function and structure names to indicate
that they are now specific to L3 events instead of generic monitoring.

5) Struct mon_evt is also the source of truth for "can this event be
read from any CPU?". Other structures (mon_data and rmid_read) now
have pointers to mon_evt instead of their own field copied from
mon_evt.

6) Events that can be read on any CPU now bypass the
cpumask_any_housekeeping() path that would have resulted in an
IPI to the first CPU on a domain. mon_event_read() now directly
calls mon_event_count() for these events.

7) Renamed the per-mount hook and commented on (lack of) locking
by the caller.

8) Split the enumeration of telemetry events into easier to
review chunks with more comments in the code at each stage.

9) Simplified the intel_aet_read_event() code. No funky macros
to pick up parameters for the MMIO address calculation. Added
a sanity check that the computed MMIO register address is in
the range provided by the aggregator.

10) File system now owns the output format. Architecture cannot
make choices. Every event is hard-coded to be displayed as
integer or floating point.

11) Added additional options to the rdt= boot option for the user
to force opt-in or opt-out of telemetry events. Use these options
to solve the "how many RMIDs can be used?" issue.

12) Moved final calculation of available number of RMIDs to first
mount of resctrl file system and make it determine smallest value
across all mon_capable resources.

13) Version 2 of the patch series included extra files in the info/
directory to report some internal status values. V3 dropped that
entirely because I couldn't see a good way to cross the fs<->arch
boundary with extra architecture specific info files. Patches
29-30 are an RFC way to bring this back when the file system is
mounted with the "debug" option.

Background
----------

Telemetry features are being implemented in conjunction with the
IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
counts for various events to a collector in a nearby OOBMSM device to be
accumulated with counts for each <RMID, event> pair received from other
CPUs. Cores send event counts when the RMID value changes, or after each
2ms elapsed time.

Each OOBMSM device may implement multiple event collectors with each
servicing a subset of the logical CPUs on a package.  In the initial
hardware implementation, there are two categories of events: energy
and perf.

1) Energy - Two counters
core_energy: This is an estimate of Joules consumed by each core. It is
calculated based on the types of instructions executed, not from a power
meter. This counter is useful to understand how much energy a workload
is consuming.

activity: This measures "accumulated dynamic capacitance". Users who
want to optimize energy consumption for a workload may use this rather
than core_energy because it provides consistent results independent of
any frequency or voltage changes that may occur during the runtime of
the application (e.g. entry/exit from turbo mode).

2) Performance - Seven counters
These are similar events to those available via the Linux "perf" tool,
but collected in a way with much lower overhead (no need to collect data
on every context switch).

stalls_llc_hit - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which hit in the LLC

c1_res - Counts the total C1 residency across all cores. The underlying
counter increments on 100MHz clock ticks

unhalted_core_cycles - Counts the total number of unhalted core clock
cycles

stalls_llc_miss - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which missed all the
local caches

c6_res - Counts the total C6 residency. The underlying counter increments
on crystal clock (25MHz) ticks

unhalted_ref_cycles - Counts the total number of unhalted reference clock
(TSC) cycles

uops_retired - Counts the total number of uops retired

The counters are arranged in groups in MMIO space of the OOBMSM device.
E.g. for the energy counters the layout is:

Offset: Counter
0x00	core energy for RMID 0
0x08	core activity for RMID 0
0x10	core energy for RMID 1
0x18	core activity for RMID 1
...

Enumeration
-----------

The only CPUID based enumeration for this feature is the legacy
CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
IA32_PQR_ASSOC MSR and the RMID field within it.

The OOBMSM driver discovers which features are present via
PCIe VSEC capabilities. Each feature is tagged with a unique
identifier. These identifiers indicate which XML description file from
https://github.com/intel/Intel-PMT describes which event counters are
available and their layout within the MMIO BAR space of the OOBMSM device.

Resctrl User Interface
----------------------

Because there may be multiple OOBMSM collection agents per processor
package, resctrl accumulates event counts from all agents on a package
and presents a single value to users. This will provide a consistent
user interface on future platforms that vary the number of collectors,
or the mappings from logical CPUs to collectors.

Users will continue to see the legacy monitoring files in the "L3"
directories and the telemetry files in the new "PERF_PKG" directories
(with each file providing the aggregated value from all OOBMSM collectors
on that package).

$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
├── mon_L3_00
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_01
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_PERF_PKG_00
│   ├── activity
│   ├── c1_res
│   ├── c6_res
│   ├── core_energy
│   ├── stalls_llc_hit
│   ├── stalls_llc_miss
│   ├── unhalted_core_cycles
│   ├── unhalted_ref_cycles
│   └── uops_retired
└── mon_PERF_PKG_01
    ├── activity
    ├── c1_res
    ├── c6_res
    ├── core_energy
    ├── stalls_llc_hit
    ├── stalls_llc_miss
    ├── unhalted_core_cycles
    ├── unhalted_ref_cycles
    └── uops_retired

Resctrl Implementation
----------------------

The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
that returns an array of structures describing the per-RMID groups it
found from the VSEC enumeration. Linux looks at the unique identifiers
for each group and enables resctrl for all groups with known unique
identifiers.

The memory map for the counters for each <RMID, event> pair is described
by the XML file. This is too unwieldy to use in the Linux kernel, so a
simplified representation is built into the resctrl code. Note that the
counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
and IA32_QM_CTR MSRs. This means there is no need for cross-processor
calls to read counters from a CPU in a specific domain. The counters
can be read from any CPU.

High level description of code changes:

1) New scope RESCTRL_PACKAGE
2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
        switch (r->rid) {
        case RDT_RESOURCE_L3:
                helper for L3
                break;
        case RDT_RESOURCE_PERF_PKG:
                helper for PKG
                break;
        }
4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.

With only one platform providing this feature, it's tricky to tell
exactly where it is going to go. I've made the event definitions
platform specific (based on the unique ID from the VSEC enumeration). It
seems possible/likely that the list of events may change from generation
to generation.

I've picked names for events based on the descriptions in the XML file.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (31):
  x86,fs/resctrl: Drop rdt_mon_features variable
  x86,fs/resctrl: Prepare for more monitor events
  fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}()
  fs/resctrl: Change how and when events are initialized
  fs/resctrl: Set up Kconfig options for telemetry events
  x86/rectrl: Fake OOBMSM interface
  x86,fs/resctrl: Improve domain type checking
  x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
  x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain
    types
  x86/resctrl: Change generic monitor functions to use struct
    rdt_domain_hdr
  x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
  fs/resctrl: Improve handling for events that can be read from any CPU
  fs/resctrl: Add support for additional monitor event display formats
  fs/resctrl: Add an architectural hook called for each mount
  x86/resctrl: Add and initialize rdt_resource for package scope core
    monitor
  x86/resctrl: Add first part of telemetry event enumeration
  x86/resctrl: Add second part of telemetry event enumeration
  x86/resctrl: Add third part of telemetry event enumeration
  x86,fs/resctrl: Fill in details of Clearwater Forest events
  x86/resctrl: Check for adequate MMIO space
  x86/resctrl: Add fourth part of telemetry event enumeration
  x86/resctrl: Read core telemetry events
  x86,fs/resctrl: Handle domain creation/deletion for
    RDT_RESOURCE_PERF_PKG
  fs/resctrl: Add type define for PERF_PKG files
  x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG
  x86/resctrl: Add energy/perf choices to rdt boot option
  x86/resctrl: Handle number of RMIDs supported by telemetry resources
  x86,fs/resctrl: Fix RMID allocation for multiple monitor resources
  fs/resctrl: Add interface for per-resource debug info files
  x86/resctrl: Add info/PERF_PKG_MON/status file
  x86/resctrl: Update Documentation for package events

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/filesystems/resctrl.rst         |  53 ++-
 include/linux/resctrl.h                       |  51 ++-
 include/linux/resctrl_types.h                 |  19 +
 arch/x86/include/asm/resctrl.h                |  16 -
 .../cpu/resctrl/fake_intel_aet_features.h     |  73 ++++
 arch/x86/kernel/cpu/resctrl/internal.h        |  35 +-
 fs/resctrl/internal.h                         |  42 ++-
 arch/x86/kernel/cpu/resctrl/core.c            | 273 ++++++++++----
 .../cpu/resctrl/fake_intel_aet_features.c     |  95 +++++
 arch/x86/kernel/cpu/resctrl/intel_aet.c       | 343 ++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c         |  61 ++--
 fs/resctrl/ctrlmondata.c                      |  93 ++---
 fs/resctrl/monitor.c                          | 269 +++++++++-----
 fs/resctrl/rdtgroup.c                         | 221 +++++++----
 arch/x86/Kconfig                              |   1 +
 arch/x86/kernel/cpu/resctrl/Makefile          |   2 +
 drivers/platform/x86/intel/pmt/Kconfig        |   7 +
 18 files changed, 1283 insertions(+), 373 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
 create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c


base-repository: git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git
base-branch: mpam/move_to_fs/v9_final
base-commit: dc979ecda2982f7c09de81cde1ec902fdc8e202f
-- 
2.48.1


             reply	other threads:[~2025-04-29  0:34 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-29  0:33 Tony Luck [this message]
2025-04-29  0:33 ` [PATCH v4 01/31] x86,fs/resctrl: Drop rdt_mon_features variable Tony Luck
2025-05-08  3:28   ` Reinette Chatre
2025-05-08 18:32     ` Luck, Tony
2025-05-08 23:44       ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 02/31] x86,fs/resctrl: Prepare for more monitor events Tony Luck
2025-05-08  3:30   ` Reinette Chatre
2025-05-09 15:02   ` Peter Newman
2025-04-29  0:33 ` [PATCH v4 03/31] fs/resctrl: Clean up rdtgroup_mba_mbps_event_{show,write}() Tony Luck
2025-05-08  3:31   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 04/31] fs/resctrl: Change how and when events are initialized Tony Luck
2025-05-08  3:31   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 05/31] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
2025-05-08  3:32   ` Reinette Chatre
2025-05-10  9:58   ` Chen, Yu C
2025-05-12 14:19     ` Luck, Tony
2025-04-29  0:33 ` [PATCH v4 06/31] x86/rectrl: Fake OOBMSM interface Tony Luck
2025-04-30 23:02   ` Luck, Tony
2025-05-08  3:33   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 07/31] x86,fs/resctrl: Improve domain type checking Tony Luck
2025-05-08  3:36   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 08/31] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
2025-05-08  3:37   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 09/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
2025-05-08  3:37   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 10/31] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
2025-05-08  3:38   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 11/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
2025-05-08  3:39   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 12/31] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
2025-05-08  3:54   ` Reinette Chatre
2025-05-13  3:19   ` Chen, Yu C
2025-05-13 16:20     ` Luck, Tony
2025-05-14  9:11       ` Chen, Yu C
2025-04-29  0:33 ` [PATCH v4 13/31] fs/resctrl: Add support for additional monitor event display formats Tony Luck
2025-05-08 15:49   ` Reinette Chatre
2025-05-08 20:28     ` Luck, Tony
2025-05-08 23:45       ` Reinette Chatre
2025-05-09 11:29         ` Dave Martin
2025-05-09 14:46           ` Peter Newman
2025-05-09 16:38             ` Luck, Tony
2025-05-09 16:43             ` Dave Martin
2025-04-29  0:33 ` [PATCH v4 14/31] fs/resctrl: Add an architectural hook called for each mount Tony Luck
2025-05-08 15:50   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 15/31] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
2025-05-08 15:50   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 16/31] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
2025-05-08 15:53   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 17/31] x86/resctrl: Add second " Tony Luck
2025-05-08 15:54   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 18/31] x86/resctrl: Add third " Tony Luck
2025-05-08 15:56   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 19/31] x86,fs/resctrl: Fill in details of Clearwater Forest events Tony Luck
2025-05-08 15:54   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 20/31] x86/resctrl: Check for adequate MMIO space Tony Luck
2025-05-08 15:56   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 21/31] x86/resctrl: Add fourth part of telemetry event enumeration Tony Luck
2025-05-08 15:56   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 22/31] x86/resctrl: Read core telemetry events Tony Luck
2025-05-08 15:57   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 23/31] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
2025-05-08 15:58   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 24/31] fs/resctrl: Add type define for PERF_PKG files Tony Luck
2025-04-29  0:33 ` [PATCH v4 25/31] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
2025-04-29  0:33 ` [PATCH v4 26/31] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
2025-05-08 15:58   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 27/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
2025-05-08 15:59   ` Reinette Chatre
2025-04-29  0:33 ` [PATCH v4 28/31] x86,fs/resctrl: Fix RMID allocation for multiple monitor resources Tony Luck
2025-04-29  0:33 ` [PATCH v4 29/31] fs/resctrl: Add interface for per-resource debug info files Tony Luck
2025-04-29  0:33 ` [PATCH v4 30/31] x86/resctrl: Add info/PERF_PKG_MON/status file Tony Luck
2025-04-29  0:33 ` [PATCH v4 31/31] x86/resctrl: Update Documentation for package events Tony Luck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250429003359.375508-1-tony.luck@intel.com \
    --to=tony.luck@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=anil.s.keshavamurthy@intel.com \
    --cc=babu.moger@amd.com \
    --cc=dfustini@baylibre.com \
    --cc=fenghuay@nvidia.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=peternewman@google.com \
    --cc=reinette.chatre@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).