public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Reinette Chatre <reinette.chatre@intel.com>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Fenghua Yu <fenghuay@nvidia.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	"Drew Fustini" <dfustini@baylibre.com>,
	Dave Martin <Dave.Martin@arm.com>,
	"Anil Keshavamurthy" <anil.s.keshavamurthy@intel.com>,
	<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring
Date: Tue, 22 Apr 2025 14:30:32 -0700	[thread overview]
Message-ID: <8e9e7e36-1c01-433f-a0ed-7bcb856affdd@intel.com> (raw)
In-Reply-To: <aAfB29E8Shqp8JY3@agluck-desk3>

Hi Tony,

On 4/22/25 9:20 AM, Luck, Tony wrote:
> On Mon, Apr 21, 2025 at 03:59:15PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 4/21/25 11:57 AM, Luck, Tony wrote:
>>> On Fri, Apr 18, 2025 at 02:13:39PM -0700, Reinette Chatre wrote:
>>>> One aspect that is only hinted to in the final documentation patch is
>>>> how users are expected to use this feature. As I understand the number of
>>>> monitor groups supported by resctrl is still guided by the number of RMIDs
>>>> supported by L3 monitoring. This work hints that the telemetry feature may
>>>> not match that number of RMIDs and a monitor group may thus exist but
>>>> when a user attempts to ready any of these perf files it will return
>>>> "unavailable".
>>>>
>>>> The series attempts to address it by placing the number of RMIDs available
>>>> for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
>>>> group is not exposed to user space (unless debugging enabled) the user does
>>>> not know if a monitor group will support this feature or not. This seems awkward
>>>> to me. Why not limit the number of monitor groups that can be created to the
>>>> minimum number of RMIDs across these resources like what is done for CLOSid?
>>>
>>> Reinette,
>>>
>>> The mismatch between number of RMIDs supported by different components
>>> is a thorny one, and may keep repeating since it feels like systems are
>>> composed of a bunch of lego-like bricks snapped together from a box of
>>> parts available to the h/w architect.
>>
>> With resctrl needing to support multiple architectures' way of doing things,
>> needing to support variety within an architecture just seems like another step.
>>
>>>
>>> In this case we have three meanings for "number of RMIDs":
>>>
>>> 1) The number for legacy features enumerated by CPUID leaf 0xF.
>>>
>>> 2) The number of registers in MMIO space for each event. This is
>>> enumerated in the XML files and is the value I placed into telem_entry::num_rmids.
>>>
>>> 3) The number of "h/w counters" (this isn't a strictly accurate
>>> description of how things work, but serves as a useful analogy that
>>> does describe the limitations) feeding to those MMIO registers. This is
>>> enumerated in telemetry_region::num_rmids returned from the call to
>>> intel_pmt_get_regions_by_feature()
>>
>> Thank you for explaining this. This was not clear to me.
>>
>>>
>>> If "1" is the smallest of these values, the OS will be limited in
>>> which values can be written to the IA32_PQR_ASSOC MSR. Existing
>>> code will do the right thing by limiting RMID allocation to this
>>> value.
>>>
>>> If "2" is greater than "1", then the extra MMIO registers will
>>> sit unused.
>>
>> This is also an issue with this implementation, no? resctrl will not
>> allow creating more monitor groups than "1".
> 
> On Intel there is no point in creating more groups than "1" allows.
> You can't make use of any RMID above that limit because you will get
> a #GP fault trying to write to the IA32_PQR_ASSOC MSR.
> 
> You could read the extra MMIO registers provided by "2", but they
> will always be zero since no execution occurred with an RMID in the
> range "1" ... "2".
> 
> The "2" is greater than "1" may be relatively common since the h/w
> for the telemetry counters is common for SKUs with different numbers
> of cores, and thus different values of "1". So low core count
> systems will see more telemetry counters than they can actually
> make use of. I will make sure not to print a message for this case.

I see, thank you.

> 
>>> If "2" is less than "1" my v3 returns the (problematic) -ENOENT
>>> This can't happen in the CPU that debuts this feature, but the check
>>> is there to prevent running past the end of the MMIO space in case
>>> this does occur some day. I'll fix error path in next version to
>>> make sure this end up with "Unavailable".
>>
>> This is a concern since this means the interface becomes a "try and see"
>> for user space. As I understand a later statement the idea is that
>> "2" should be used by user space to know how many "mon_groups" directories
>> should be created to get telemetry support. To me this looks to be
>> a space that will create a lot of confusion. The moment user space
>> creates "2" + 1 "mon_groups" directories it becomes a guessing game
>> of what any new monitor group actually supports. After crossing that
>> threshold I do not see a good way for going back since if user space
>> removes one "mon_data" directory it does get back to "2" but then needs to
>> rely on resctrl internals or debugging to know for sure what the new
>> monitor group supports.
> 
> But I assert that it is a "can't happen" concern. "2" will be >= "1".
> See below. I will look at addressing this, unless it gets crazy complex
> because of the different enumeration timeline. Delaying calculation of
> number of RMIDs until rdt_get_tree() as you have suggested may be the
> right thing to do.

As you explain it, it does not sound as though calculating how many
RMIDs can be supported is required to be done in rdt_get_tree(), but doing
so would make the implementation more robust since doing so does not rely
on assumptions about what hardware can and will support. 

> 
> "3" is the real problem
> 
>>>
>>> If "3" is less than "2" then the system will attach "h/w counters" to
>>> MMIO registers in a "most recently used" algorithm. So if the number
>>> of active RMIDs in some time interval is less than "3" the user will
>>> get good values. But if the number of active RMIDs rises above "3"
>>> then the user will see "Unavailable" returns as "h/w counters" are
>>> reassigned to different RMIDs (making the feature really hard to use).
>>
>> Could the next step be for the architecture to allow user space to
>> specify which hardware counters need to be assigned? With a new user
>> interface being created for such capability it may be worthwhile to
>> consider how it could be used/adapted for this feature. [1]
>>
>>>
>>> In the debut CPU the "energy" feature has sufficient "energy" counters
>>> to avoid this. But not enough "perf" counters. I've pushed and the
>>> next CPU with the feature will have enough "h/w counters".
>>>
>>> My proposal for v4:
>>>
>>> Add new options to the "rdt=" kernel boot parameter for "energy"
>>> and "perf".
>>>
>>> Treat the case where there are not enough "h/w counters" as an erratum
>>> and do not enable the feature. User can override with "rdt=perf"
>>> if they want the counters for some special case where they limit
>>> the number of simultaneous active RMIDs.

I get this now. This will require rework of the kernel command line parsing
support since current implementation is so closely integrated with the
X86_FEATURE_* flags (and is perhaps an unexpected architecture specific
portion of resctrl). 
What if "rdt=perf" means that "3" is also included in the computation
of how many monitor groups are supported? That would help users to not
need to limit the number of simultaneous active RMIDs.

>>
>> This only seems to address the "3" is less than "2" issue. It is not
>> so obvious to me that it should be treated as an erratum. Although,
>> I could not tell from your description how obvious this issue will be
>> to user space. For example, is it clear that if user space
>> gets *any* value then it is "good" and "Unavailable" means ... "Unavailable", or
>> could a returned value mean "this is partial data that was collected
>> during timeframe with hardware counter re-assigned at some point"?
> 
> When running jobs with more distinct RMIDs than "3" users are at the
> mercy of the h/w replacement algorithm. Resctrl use cases for monitoring
> are all "read an event counter; wait for some time; re-read the event
> counter; compute the rate". With "h/w counter" reassignment the second
> read may get "Unavailable", or worse the "h/w counter" may have been
> taken, and the returned so a value will be provided to the user, but
> it won't provide the count of events since the first read.
> 
> That's why I consider this an erratum. There's just false hope that
> you can get a pair of meaningful event counts and no sure indication
> that you didn't get garbage.
> 
>>>
>>> User can use "rdt=!energy,!perf" if they don't want to see the
>>> clutter of all the new files in each mon_data directory.
>>>
>>> I'll maybe look at moving resctrl_mon_resource_init() to rdt_get_tree()
>>> and add a "take min of all RMID limits". But since this is a "can't
>>> happen" scenario I may skip this if it starts to get complicated.
>>
>> I do not think that the "2" is less than "1" scenario should be 
>> ignored for reasons stated above and in review of this version.
>>
>> What if we enhance resctrl's RMID assignment (setting aside for
>> a moment PMG assignment) to be directed by user space?
> 
> I'll take a look at reducing user reported num_rmids to the minimum
> of the "1" and "2" values.

When comparing to "num_closids" the expectation may be that "num_rmids"
would be accurate for particular resource with understanding that the
minimum among all resources guides the number of monitor groups. This
seems close enough to existing interface to not use this as moment
to move to a new "num_mon_hw_id" or such that works for MPAM also.

> 
>> Below is an idea of an interface that can give user space 
>> control over what monitor groups are monitoring. This is very likely not
>> the ideal interface but I would like to present it as a start for
>> better ideas.
>>
>> For example, monitor groups are by default created with most abundant
>> (and thus supporting fewest features on fewest resources) RMID.
>> The user is then presented with a new file (within each monitor group)
>> that lists all available features and which one(s) are active. For example,
>> let's consider hypothetical example where PERF_PKG perf has x RMID, PERF_PKG energy
>> has y RMID, and L3_MON has z RMID, with x < y < z. By default when user space
>> creates a monitor group resctrl will pick "abundant" RMID from range y + 1 to z
>> that only supports L3 monitoring:
> 
> There is no way for s/w to control the reallocation of "h/w counters"
> when "3" is too small. So there is no set of RMIDs that support many
> events vs. fewer events. AMD is solving this similar problem with their
> scheme to pin h/w counters to specific RMIDs. I discussed such an option
> for the "3" case, but it wasn't practical to apply to the upcoming CPU
> that has this problem. The long term solution is to ensure that "3" is
> always large enough that all RMIDs have equal monitoring capabilities.

ack.

Reinette

  reply	other threads:[~2025-04-22 21:30 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-07 23:40 [PATCH v3 00/26] x86/resctrl telemetry monitoring Tony Luck
2025-04-07 23:40 ` [PATCH v3 01/26] fs/resctrl: Simplify allocation of mon_data structures Tony Luck
2025-04-18 21:13   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 02/26] fs-x86/resctrl: Prepare for more monitor events Tony Luck
2025-04-18 21:17   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 03/26] fs/resctrl: Change how events are initialized Tony Luck
2025-04-18 21:22   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 04/26] fs/resctrl: Set up Kconfig options for telemetry events Tony Luck
2025-04-18 21:23   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 05/26] x86/rectrl: Fake OOBMSM interface Tony Luck
2025-04-18 21:27   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 06/26] fs-x86/rectrl: Improve domain type checking Tony Luck
2025-04-18 21:40   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 07/26] x86/resctrl: Move L3 initialization out of domain_add_cpu_mon() Tony Luck
2025-04-18 21:51   ` Reinette Chatre
2025-04-21 20:01     ` Luck, Tony
2025-04-22 18:18       ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 08/26] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
2025-04-18 21:53   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 09/26] x86/resctrl: Change generic monitor functions to use struct rdt_domain_hdr Tony Luck
2025-04-18 22:42   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU Tony Luck
2025-04-18 22:54   ` Reinette Chatre
2025-04-21 20:28     ` Luck, Tony
2025-04-22 18:19       ` Reinette Chatre
2025-04-23  0:51         ` Luck, Tony
2025-04-23  3:37           ` Reinette Chatre
2025-04-23 13:27         ` Peter Newman
2025-04-23 15:47           ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 11/26] fs/resctrl: Add support for additional monitor event display formats Tony Luck
2025-04-18 23:02   ` Reinette Chatre
2025-04-21 19:34     ` Luck, Tony
2025-04-22 18:20       ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 12/26] fs/resctrl: Add hook for architecture code to set monitor event attributes Tony Luck
2025-04-18 23:11   ` Reinette Chatre
2025-04-21 19:50     ` Luck, Tony
2025-04-22 18:20       ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 13/26] fs/resctrl: Add an architectural hook called for each mount Tony Luck
2025-04-18 23:47   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 14/26] x86/resctrl: Add first part of telemetry event enumeration Tony Luck
2025-04-19  0:08   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 15/26] x86/resctrl: Second stage " Tony Luck
2025-04-19  0:30   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 16/26] x86/resctrl: Third phase " Tony Luck
2025-04-19  0:45   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 17/26] x86/resctrl: Build a lookup table for each resctrl event id Tony Luck
2025-04-19  0:48   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 18/26] x86/resctrl: Add code to read core telemetry events Tony Luck
2025-04-19  1:53   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 19/26] x86/resctrl: Sanity check telemetry RMID values Tony Luck
2025-04-19  5:14   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 20/26] x86/resctrl: Add and initialize rdt_resource for package scope core monitor Tony Luck
2025-04-07 23:40 ` [PATCH v3 21/26] fs-x86/resctrl: Handle RDT_RESOURCE_PERF_PKG in domain create/delete Tony Luck
2025-04-19  5:22   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 22/26] fs/resctrl: Add type define for PERF_PKG files Tony Luck
2025-04-07 23:40 ` [PATCH v3 23/26] fs/resctrl: Add new telemetry event id and structures Tony Luck
2025-04-07 23:40 ` [PATCH v3 24/26] x86/resctrl: Final steps to enable RDT_RESOURCE_PERF_PKG Tony Luck
2025-04-07 23:40 ` [PATCH v3 25/26] fs-x86/resctrl: Add detailed descriptions for Clearwater Forest events Tony Luck
2025-04-19  5:30   ` Reinette Chatre
2025-04-07 23:40 ` [PATCH v3 26/26] x86/resctrl: Update Documentation for package events Tony Luck
2025-04-19  5:40   ` Reinette Chatre
2025-04-18 21:13 ` [PATCH v3 00/26] x86/resctrl telemetry monitoring Reinette Chatre
2025-04-21 18:57   ` Luck, Tony
2025-04-21 22:59     ` Reinette Chatre
2025-04-22 16:20       ` Luck, Tony
2025-04-22 21:30         ` Reinette Chatre [this message]
2025-04-19  5:47 ` Reinette Chatre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8e9e7e36-1c01-433f-a0ed-7bcb856affdd@intel.com \
    --to=reinette.chatre@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=anil.s.keshavamurthy@intel.com \
    --cc=babu.moger@amd.com \
    --cc=dfustini@baylibre.com \
    --cc=fenghuay@nvidia.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=peternewman@google.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox