public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Reinette Chatre <reinette.chatre@intel.com>
To: Tony Luck <tony.luck@intel.com>, Fenghua Yu <fenghuay@nvidia.com>,
	"Maciej Wieczor-Retman" <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	Drew Fustini <dfustini@baylibre.com>,
	Dave Martin <Dave.Martin@arm.com>,
	Anil Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: <linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values
Date: Fri, 18 Apr 2025 22:14:33 -0700	[thread overview]
Message-ID: <fba9db1e-4840-432e-87ff-12819381ff41@intel.com> (raw)
In-Reply-To: <20250407234032.241215-20-tony.luck@intel.com>

Hi Tony,

On 4/7/25 4:40 PM, Tony Luck wrote:
> There are three values of interest:
> 1) The number of RMIDs supported by the CPU core. This is enumerated by
>    CPUID leaf 0xF. Linux saves the value in boot_cpu_data.x86_cache_max_rmid.
> 2) The number of counter registers in each telemetry region. This is
>    described in the XML file for the region. Linux hard codes it into
>    the struct telem_entry..num_rmids field.

Syntax telem_entry::num_rmids can be used for a member.

> 3) The maximum number of RMIDs that can be tracked simultaneously for
>    a telemetry region. This is provided in the structures received from
>    the intel_pmt_get_regions_by_feature() calls.

Is (2) and (3) not required to be the same? If not, how does resctrl know
which counter/RMID is being tracked?

> 
> Print appropriate warnings if these values do not match.

As mentioned in cover letter I do not think that just printing a warning
is sufficient. It really becomes a trial-and-error guessing game for user
space to know which monitor group supports telemetry events.

> 
> TODO: Need a better UI. The number of implemented counters can be
> different per telemetry region.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/intel_aet.c | 31 +++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 67a1245858dc..0bcbac326bee 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -13,6 +13,7 @@
>  
>  #include <linux/cpu.h>
>  #include <linux/cleanup.h>
> +#include <linux/minmax.h>

Please sort includes alphabetically.

>  #include "fake_intel_aet_features.h"
>  #include <linux/intel_vsec.h>
>  #include <linux/resctrl.h>
> @@ -51,6 +52,7 @@ struct pmt_event {
>   * @last_overflow_tstamp_off:	Offset of overflow timestamp
>   * @last_update_tstamp_off:	Offset of last update timestamp
>   * @active:			Marks this group as active on this system
> + * @rmid_warned:		Set to stop multiple rmid sanity warnings

rmid -> RMID. 

I find the description unclear on how to interact with this member. How about
something like:
	True if user space have been warned about number of RMIDs used by
	different resources not matching.

>   * @num_events:			Size of @evts array
>   * @evts:			Telemetry events in this group
>   */
> @@ -63,6 +65,7 @@ struct telem_entry {
>  	int	last_overflow_tstamp_off;
>  	int	last_update_tstamp_off;
>  	bool	active;
> +	bool	rmid_warned;
>  	int	num_events;
>  	struct pmt_event evts[];
>  };
> @@ -84,6 +87,33 @@ static struct telem_entry *telem_entry[] = {
>  	NULL
>  };
>  
> +static void rmid_sanity_check(struct telemetry_region *tr, struct telem_entry *tentry)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> +	int system_rmids = boot_cpu_data.x86_cache_max_rmid + 1;

It is not clear what "system_rmids" should represent here. Is it, as changelog states,
maximum supported by CPU core, or is it maximum supported by L3 resource, which is the
maximum number of monitor groups that can be created.

We see in rdt_get_mon_l3_config() that:
	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;

This makes me wonder how this feature behaves on SNC systems?

> +
> +	if (tentry->rmid_warned)
> +		return;
> +
> +	if (tentry->num_rmids != system_rmids) {
> +		pr_info("Telemetry region %s has %d RMIDs system supports %d\n",

Is pr_info() intended to be pr_warn()?
The message self could do with a comma?

> +			tentry->name, tentry->num_rmids, system_rmids);
> +		tentry->rmid_warned = true;
> +	}

Could you please add comments about consequences of when this is encountered?

> +
> +	if (tr->num_rmids < tentry->num_rmids) {
> +		pr_info("Telemetry region %s only supports %d simultaneous RMIDS\n",
> +			tentry->name, tr->num_rmids);
> +		tentry->rmid_warned = true;
> +	}

I am still trying to get used to all the data structures. From what I can tell, the
offset of counter is obtained from struct telem_entry. If struct telem_entry thus
thinks there are more RMIDs than what the region supports, would this not cause
memory reads to exceed what region supports?

Could you please add comments about consequences of when this is encountered?

> +
> +	/* info/PKG_PERF_MON/num_rmids reports number of guaranteed counters */
> +	if (!r->num_rmid)
> +		r->num_rmid = tr->num_rmids;
> +	else
> +		r->num_rmid = min((u32)r->num_rmid, tr->num_rmids);
> +}

As I mentioned in response to previous version it may be possible to move
resctrl_mon_resource_init() to rdt_get_tree() to be done after these RMID
counts are discovered. When doing so it is possible to size the available
RMIDs used on system to be supported by all resources.

> +
>  /*
>   * Scan a feature group looking for guids recognized
>   * and update the per-package counts of known groups.
> @@ -109,6 +139,7 @@ static bool count_events(struct pkg_info *pkg, int max_pkgs, struct pmt_feature_
>  					pr_warn_once("MMIO region for guid 0x%x too small\n", tr->guid);
>  					continue;
>  				}
> +				rmid_sanity_check(tr, *tentry);
>  				found = true;
>  				(*tentry)->active = true;
>  				pkg[tr->plat_info.package_id].count++;

Reinette

  reply	other threads:[~2025-04-19  5:14 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
2025-04-07 23:40 ` [PATCH v3 01/26] fs/resctrl: Simplify allocation of mon_data structures Tony Luck
2025-04-18 21:13   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 02/26] fs-x86/resctrl: Prepare for more monitor events Tony Luck
2025-04-18 21:17   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 03/26] fs/resctrl: Change how events are initialized Tony Luck
2025-04-18 21:22   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 04/26] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
2025-04-18 21:23   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 05/26] x86/rectrl: Fake OOBMSM interface Tony Luck
2025-04-18 21:27   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 06/26] fs-x86/rectrl: Improve domain type checking Tony Luck
2025-04-18 21:40   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
2025-04-18 21:51   ` Reinette Chatre
2025-04-21 20:01     ` Luck, Tony
2025-04-22 18:18       ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 08/26] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
2025-04-18 21:53   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 09/26] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
2025-04-18 22:42   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
2025-04-18 22:54   ` Reinette Chatre
2025-04-21 20:28     ` Luck, Tony
2025-04-22 18:19       ` Reinette Chatre
2025-04-23  0:51         ` Luck, Tony
2025-04-23  3:37           ` Reinette Chatre
2025-04-23 13:27         ` Peter Newman
2025-04-23 15:47           ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats Tony Luck
2025-04-18 23:02   ` Reinette Chatre
2025-04-21 19:34     ` Luck, Tony
2025-04-22 18:20       ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes Tony Luck
2025-04-18 23:11   ` Reinette Chatre
2025-04-21 19:50     ` Luck, Tony
2025-04-22 18:20       ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 13/26] fs/resctrl: Add an architectural hook called for each mount Tony Luck
2025-04-18 23:47   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 14/26] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
2025-04-19  0:08   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 15/26] x86/resctrl: Second stage " Tony Luck
2025-04-19  0:30   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 16/26] x86/resctrl: Third phase " Tony Luck
2025-04-19  0:45   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 17/26] x86/resctrl: Build a lookup table for each resctrl event id Tony Luck
2025-04-19  0:48   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 18/26] x86/resctrl: Add code to read core telemetry events Tony Luck
2025-04-19  1:53   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values Tony Luck
2025-04-19  5:14   ` Reinette Chatre [this message]
2025-04-07 23:40 ` [PATCH v3 20/26] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
2025-04-07 23:40 ` [PATCH v3 21/26] fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete Tony Luck
2025-04-19  5:22   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 22/26] fs/resctrl: Add type define for PERF_PKG files Tony Luck
2025-04-07 23:40 ` [PATCH v3 23/26] fs/resctrl: Add new telemetry event id and structures Tony Luck
2025-04-07 23:40 ` [PATCH v3 24/26] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
2025-04-07 23:40 ` [PATCH v3 25/26] fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events Tony Luck
2025-04-19  5:30   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 26/26] x86/resctrl: Update Documentation for package events Tony Luck
2025-04-19  5:40   ` Reinette Chatre
2025-04-18 21:13 ` [PATCH v3 00/26] x86/resctrl telemetry monitoring Reinette Chatre
2025-04-21 18:57   ` Luck, Tony
2025-04-21 22:59     ` Reinette Chatre
2025-04-22 16:20       ` Luck, Tony
2025-04-22 21:30         ` Reinette Chatre
2025-04-19  5:47 ` Reinette Chatre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fba9db1e-4840-432e-87ff-12819381ff41@intel.com \
    --to=reinette.chatre@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=anil.s.keshavamurthy@intel.com \
    --cc=babu.moger@amd.com \
    --cc=dfustini@baylibre.com \
    --cc=fenghuay@nvidia.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=peternewman@google.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox