From: Reinette Chatre <reinette.chatre@intel.com>
To: Dave Martin <Dave.Martin@arm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>,
<linux-kernel@vger.kernel.org>,
"James Morse" <james.morse@arm.com>,
Thomas Gleixner <tglx@linutronix.de>,
"Ingo Molnar" <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
"Jonathan Corbet" <corbet@lwn.net>, <x86@kernel.org>,
<linux-doc@vger.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
Date: Fri, 17 Oct 2025 08:59:45 -0700 [thread overview]
Message-ID: <e788ca62-ec63-4552-978b-9569f369afd5@intel.com> (raw)
In-Reply-To: <aPJP52jXJvRYAjjV@e133380.arm.com>
Hi Dave,
On 10/17/25 7:17 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/15/25 8:47 AM, Dave Martin wrote:
>>> Hi Reinette,
>>>
>>> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 10/13/25 7:36 AM, Dave Martin wrote:
...
>>>>> So long as the entries affecting a single resource are ordered so that
>>>>> each entry is strictly more specific than the previous entries (as
>>>>> illustrated above), then reading schemata and stripping all the hashes
>>>>> would allow a previous configuration to be restored; to change just one
>>>>> entry, userspace can uncomment just that one, or write only that entry
>>>>> (which is what I think we should recommend for new software).
>>>>
>>>> This is a good rule of thumb.
>>>
>>> To avoid printing entries in the wrong order, do we want to track some
>>> parent/child relationship between schemata.
>>>
>>> In the above example,
>>>
>>> * MB is the parent of MB_HW;
>>>
>>> * MB_HW is the parent of MB_MIN and MB_MAX.
>>>
>>> (for MPAM, at least).
>>
>> Could you please elaborate this relationship? I envisioned the MB_HW to be
>> something similar to Intel RDT's "optimal" bandwidth setting ... something
>> that is expected to be somewhere between the "min" and the "max".
>>
>> But, now I think I'm a bit lost in MPAM since it is not clear to me what
>> MB_HW represents ... would this be the "memory bandwidth portion
>> partitioning"? Although, that uses a completely different format from
>> "min" and "max".
>
> I confess that I'm thinking with an MPAM mindset here.
>
> Some pseudocode might help to illustrate how these might interact:
>
> set_MB(partid, val) {
> set_MB_HW(partid, percent_to_hw_val(val));
> }
>
> set_MB_HW(partid, val) {
> set_MB_MAX(partid, val);
>
> /*
> * Hysteresis to avoid steady flows from ping-ponging
> * between low and high priority:
> */
> if (hardware_has_MB_MIN())
> set_MB_MIN(partid, val * 95%);
> }
>
> set_MB_MIN(partid, val) {
> mpam->MBW_MIN[partid] = val;
> }
>
> set_MB_MAX(partid, val) {
> mpam->MBW_MAX[partid] = val;
> }
>
> with
>
> get_MB(partid) {
> return hw_val_to_percent(get_MB_HW(partid));
> }
>
> get_MB_HW(partid) { return get_MB_MAX(partid); }
>
> get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; }
>
> get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
>
>
> The parent/child relationship I suggested is basically the call-graph
> of this pseudocode. These could all be exposed as resctrl schemata,
> but the children provide finer / more broken-down control than the
> parents. Reading a parent provides a merged or approximated view of
> the configuration of the child schemata.
>
> In particular,
>
> set_child(partid, get_child(partid));
> get_parent(partid);
>
> yields the same result as
>
> get_parent(partid);
>
> but will not be true in general, if the roles of parent and child are
> reversed.
>
> I think still this holds true if implementing an "MB_HW" schema for
> newer revisions of RDT. The pseudocode would be different, but there
> will still be a tree-like call graph (?)
Thank you very much for the example. I missed in earlier examples that
MB_HW was being controlled via MB_MAX and MB_MIN.
I do not expect such a dependence or tree-like call graph for RDT where
the closest equivalent (termed "optimal") is programmed independently from
min and max.
>
>
> Going back to MPAM:
>
> Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or
> MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory
> bandwidth is split into discrete, non-overlapping chunks, and each
> PARTID is configured with a bitmap saying which chunks it can use.
> This could be done by time-slicing, or controlling which memory
> controllers/ports a PARTID can issue requests to, or something like
> that.
>
> If the MBW_MAX control isn't implemented, then the MPAM current driver
> maps this bitmap control onto the resctrl "MB" schema in a simple way,
> but we are considering dropping this, since the allocation model
> (explicit, static allocation of discrete resources) is not really the
> same as for RDT MBA (dynamic prioritisation based on recent resource
> consumption).
>
> Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend
> on an equal footing for memory bandwidth until one exceeds 50% (when it
> will start to be penalised). Prorgamming bitmaps can't have the same
> effect. For example, with { 1100, 0110, 0011, 1001 }, no group can use
> more than 50% of the full bandwidth, whatever happens. Worse, certain
> pairs of groups are fully isolated from each other, while others are
> always in contention, not matter how little actual traffic is generated.
> This is potentially useful, but it's not the same as the MIN/MAX model.
>
> So, it may make more sense to expose this as a separate, bitmap schema.
>
> (The same goes for "Proportional stride" partitioning. It's another,
> different, control for memory bandwidth. As of today, I don't think
> that we have a reference platform for experimenting with either of
> these.)
Thank you.
>
>
>>> When schemata is read, parents should always be printed before their
>>> child schemata. But really, we just need to make sure that the
>>> rdt_schema_all list is correctly ordered.
>>>
>>>
>>> Do you think that this relationship needs to be reported to userspace?
>>
>> You brought up the topic of relationships in
>> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
>> to learn more from the MPAM spec where I learned and went on tangent about all
>> the other possible namespaces without circling back.
>>
>> I was hoping that the namespace prefix would make the relationships clear,
>> something like <resource>_<control>, but I did not expect another layer in
>> the hierarchy like your example above. The idea of "parent" and "child" is
>> also not obvious to me at this point. resctrl gives us a "resource" to start
>> with and we are now discussing multiple controls per resource. Could you please
>> elaborate what you see as "parent" and "child"?
>
> See above -- the parent/child concept is not an MPAM thing; apologies
> if I didn't make that clear.
>
>> We do have the info directory available to express relationships and a
>> hierarchy is already starting to taking shape there.
>
> I'm wondering whether using a common prefix will be future-proof? It
> may not always be clear which part of a name counts as the common
> prefix.
Apologies for my cryptic response. I was actually musing that we already
discussed using the info directory to express relationships between
controls and resources and it does not seem a big leap to expand
this to express relationships between controls. Consider something
like below for MPAM:
info
└── MB
└── resource_schemata
└── MB
└── MB_HW
├── MB_MAX
└── MB_MIN
On RDT it may then look different:
info
└── MB
└── resource_schemata
└── MB
├── MB_HW
├── MB_MAX
└── MB_MIN
Having the resource name as common prefix does seem consistent and makes
clear to user space which controls apply to a resource.
>
> There were already discussions about appending a number to a schema
> name in order to control different memory regions -- that's another
> prefix/suffix relationship, if so...
>
> We could handle all of this by documenting all the relationships
> explicitly. But I'm thinking that it could be easier for maintanance
> if the resctrl core code has explicit knowledge of the relationships.
Not just for resctrl self but to make clear to user space which
controls impact others and which are independent.
> That said, using a common prefix is still a good idea. But maybe we
> shouldn't lean on it too heavily as a way of actually describing the
> relationships?
I do not think we can rely on order in schemata file though. For example,
I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
case the schemata may print something like below on both platforms (copied from
your original example) where for MPAM it implies a relationship but for RDT it
does not:
MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32
>>> Since the "#" convention is for backward compatibility, maybe we should
>>> not use this for new schemata, and place the burden of managing
>>> conflicts onto userspace going forward. What do you think?
>>
>> I agree. The way I understand this is that the '#' will only be used for
>> new controls that shadow the default/current controls of the legacy resources.
>> I do not expect that the prefix will be needed for new resources, even if
>> the initial support of a new resource does not include all possible controls.
>
> OK. Note, relating this to the above, the # could be interpreted as
> meaning "this is a child of some other schema; don't mess with it
> unless you know what you are doing".
Could it be made more specific to be "this is a child of a legacy schema created
before this new format existed; don't mess with it unless you know what you are
doing"?
That is, any schema created after this new format is established does not need
the '#' prefix even if there is a parent/child relationship?
>
> Older software doesn't understand the relationships, so this is just
> there to stop it from shooting itself in the foot.
ack.
By extension I assume that software that understands a schema that is introduced
after the "relationship" format is established can be expected to understand the
format and thus these new schemata do not require the '#' prefix. Even if
a new schema is introduced with a single control it can be followed by a new child
control without a '#' prefix a couple of kernel releases later. By this point it
should hopefully be understood by user space that it should not write entries it does
not understand.
>
> [...]
>
>>>>>> MPAM has the "HARDLIM" distinction associated with these MAX values
>>>>>> and from what I can tell this is per PARTID. Is this something that needs
>>>>>> to be supported? To do this resctrl will need to support modifying
>>>>>> control properties per resource group.
>>>>>
>>>>> Possibly. Since this is a boolean control that determines how the
>>>>> MBW_MAX control is applied, we could perhaps present it as an
>>>>> additional schema -- if so, it's basically orthogonal.
>>>>>
>>>>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>>>>>
>>>>> or
>>>>>
>>>>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>>>>>
>>>>> Does this look reasonable?
>>>>
>>>> It does.
>>>
>>> OK -- note, I don't think we have any immediate plan to support this in
>>> the MPAM driver, but it may land eventually in some form.
>>>
>>
>> ack.
>
> (Or, of course, anything else that achieves the same goal...)
Right ... I did not dig into syntax that could be made to match existing
schema formats etc. that can be filled in later.
...
>>> I'll try to pull the state of this discussion together -- maybe as a
>>> draft update to the documentation, describing the interface as proposed
>>> so far. Does that work for you?
>>
>> It does. Thank you very much for taking this on.
>>
>> Reinette
>
> OK, I'll aim to follow up on this next week.
Thank you very much.
Reinette
next prev parent reply other threads:[~2025-10-17 16:00 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin
2025-09-12 22:19 ` Reinette Chatre
2025-09-22 14:39 ` Dave Martin
2025-09-23 17:27 ` Reinette Chatre
2025-09-25 12:46 ` Dave Martin
2025-09-25 20:53 ` Reinette Chatre
2025-09-25 21:35 ` Luck, Tony
2025-09-25 22:18 ` Reinette Chatre
2025-09-29 13:08 ` Dave Martin
2025-09-29 12:43 ` Dave Martin
2025-09-29 15:38 ` Reinette Chatre
2025-09-29 16:10 ` Dave Martin
2025-10-15 15:18 ` Dave Martin
2025-10-16 15:57 ` Reinette Chatre
2025-10-17 15:52 ` Dave Martin
2025-09-22 15:04 ` Dave Martin
2025-09-25 22:58 ` Luck, Tony
2025-09-29 9:19 ` Chen, Yu C
2025-09-29 14:13 ` Dave Martin
2025-09-29 16:23 ` Luck, Tony
2025-09-30 11:02 ` Chen, Yu C
2025-09-30 16:08 ` Luck, Tony
2025-09-30 4:43 ` Chen, Yu C
2025-09-30 15:55 ` Dave Martin
2025-10-01 12:13 ` Chen, Yu C
2025-10-02 15:40 ` Dave Martin
2025-10-02 16:43 ` Luck, Tony
2025-09-29 13:56 ` Dave Martin
2025-09-29 16:09 ` Reinette Chatre
2025-09-30 15:40 ` Dave Martin
2025-10-10 16:48 ` Reinette Chatre
2025-10-11 17:15 ` Chen, Yu C
2025-10-13 15:01 ` Dave Martin
2025-10-13 14:36 ` Dave Martin
2025-10-14 22:55 ` Reinette Chatre
2025-10-15 15:47 ` Dave Martin
2025-10-15 18:48 ` Luck, Tony
2025-10-16 14:50 ` Dave Martin
2025-10-16 16:31 ` Reinette Chatre
2025-10-17 14:17 ` Dave Martin
2025-10-17 15:59 ` Reinette Chatre [this message]
2025-10-20 15:50 ` Dave Martin
2025-10-20 16:31 ` Luck, Tony
2025-10-21 14:37 ` Dave Martin
2025-10-21 20:59 ` Luck, Tony
2025-10-22 14:58 ` Dave Martin
2025-10-22 16:21 ` Luck, Tony
2025-10-23 14:04 ` Dave Martin
2025-09-29 16:37 ` Luck, Tony
2025-09-30 16:02 ` Dave Martin
2025-09-26 20:54 ` Reinette Chatre
2025-09-29 13:40 ` Dave Martin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e788ca62-ec63-4552-978b-9569f369afd5@intel.com \
--to=reinette.chatre@intel.com \
--cc=Dave.Martin@arm.com \
--cc=bp@alien8.de \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=james.morse@arm.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox