Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Reinette Chatre <reinette.chatre@intel.com>
To: Dave Martin <Dave.Martin@arm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>,
	<linux-kernel@vger.kernel.org>,
	"James Morse" <james.morse@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Jonathan Corbet" <corbet@lwn.net>, <x86@kernel.org>,
	<linux-doc@vger.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
Date: Fri, 17 Oct 2025 08:59:45 -0700	[thread overview]
Message-ID: <e788ca62-ec63-4552-978b-9569f369afd5@intel.com> (raw)
In-Reply-To: <aPJP52jXJvRYAjjV@e133380.arm.com>

Hi Dave,

On 10/17/25 7:17 AM, Dave Martin wrote:
> Hi Reinette,
> 
> On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/15/25 8:47 AM, Dave Martin wrote:
>>> Hi Reinette,
>>>
>>> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 10/13/25 7:36 AM, Dave Martin wrote:

...

>>>>> So long as the entries affecting a single resource are ordered so that
>>>>> each entry is strictly more specific than the previous entries (as
>>>>> illustrated above), then reading schemata and stripping all the hashes
>>>>> would allow a previous configuration to be restored; to change just one
>>>>> entry, userspace can uncomment just that one, or write only that entry
>>>>> (which is what I think we should recommend for new software).
>>>>
>>>> This is a good rule of thumb.
>>>
>>> To avoid printing entries in the wrong order, do we want to track some
>>> parent/child relationship between schemata.
>>>
>>> In the above example,
>>>
>>> 	* MB is the parent of MB_HW;
>>>
>>> 	* MB_HW is the parent of MB_MIN and MB_MAX.
>>>
>>> (for MPAM, at least).
>>
>> Could you please elaborate this relationship? I envisioned the MB_HW to be
>> something similar to Intel RDT's "optimal" bandwidth setting ... something
>> that is expected to be somewhere between the "min" and the "max".
>>
>> But, now I think I'm a bit lost in MPAM since it is not clear to me what
>> MB_HW represents ... would this be the "memory bandwidth portion
>> partitioning"? Although, that uses a completely different format from
>> "min" and "max".
> 
> I confess that I'm thinking with an MPAM mindset here.
> 
> Some pseudocode might help to illustrate how these might interact:
> 
> 	set_MB(partid, val) {
> 		set_MB_HW(partid, percent_to_hw_val(val));
> 	}
> 
> 	set_MB_HW(partid, val) {
> 		set_MB_MAX(partid, val);
> 
> 		/*
> 		 * Hysteresis to avoid steady flows from ping-ponging
> 		 * between low and high priority:
> 		 */
> 		if (hardware_has_MB_MIN())
> 			set_MB_MIN(partid, val * 95%);
> 	}
> 
> 	set_MB_MIN(partid, val) {
> 		mpam->MBW_MIN[partid] = val;
> 	}
> 
> 	set_MB_MAX(partid, val) {
> 		mpam->MBW_MAX[partid] = val;
> 	}
> 
> with
> 
> 	get_MB(partid) {
> 		return hw_val_to_percent(get_MB_HW(partid));
> 	}
> 
> 	get_MB_HW(partid) { return get_MB_MAX(partid); }
> 
> 	get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; }
> 
> 	get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
> 
> 
> The parent/child relationship I suggested is basically the call-graph
> of this pseudocode.  These could all be exposed as resctrl schemata,
> but the children provide finer / more broken-down control than the
> parents.  Reading a parent provides a merged or approximated view of
> the configuration of the child schemata.
> 
> In particular,
> 
> 	set_child(partid, get_child(partid));
> 	get_parent(partid);
> 
> yields the same result as
> 
> 	get_parent(partid);
> 
> but will not be true in general, if the roles of parent and child are
> reversed.
> 
> I think still this holds true if implementing an "MB_HW" schema for
> newer revisions of RDT.  The pseudocode would be different, but there
> will still be a tree-like call graph (?)

Thank you very much for the example. I missed in earlier examples that
MB_HW was being controlled via MB_MAX and MB_MIN.
I do not expect such a dependence or tree-like call graph for RDT where
the closest equivalent (termed "optimal") is programmed independently from
min and max.

> 
> 
> Going back to MPAM:
> 
> Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or
> MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory
> bandwidth is split into discrete, non-overlapping chunks, and each
> PARTID is configured with a bitmap saying which chunks it can use.
> This could be done by time-slicing, or controlling which memory
> controllers/ports a PARTID can issue requests to, or something like
> that.
> 
> If the MBW_MAX control isn't implemented, then the MPAM current driver
> maps this bitmap control onto the resctrl "MB" schema in a simple way,
> but we are considering dropping this, since the allocation model
> (explicit, static allocation of discrete resources) is not really the
> same as for RDT MBA (dynamic prioritisation based on recent resource
> consumption).
> 
> Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend
> on an equal footing for memory bandwidth until one exceeds 50% (when it
> will start to be penalised).  Prorgamming bitmaps can't have the same
> effect.  For example, with { 1100, 0110, 0011, 1001 }, no group can use
> more than 50% of the full bandwidth, whatever happens.  Worse, certain
> pairs of groups are fully isolated from each other, while others are
> always in contention, not matter how little actual traffic is generated.
> This is potentially useful, but it's not the same as the MIN/MAX model.
> 
> So, it may make more sense to expose this as a separate, bitmap schema.
> 
> (The same goes for "Proportional stride" partitioning.  It's another,
> different, control for memory bandwidth.  As of today, I don't think
> that we have a reference platform for experimenting with either of
> these.)

Thank you.

> 
> 
>>> When schemata is read, parents should always be printed before their
>>> child schemata.  But really, we just need to make sure that the
>>> rdt_schema_all list is correctly ordered.
>>>
>>>
>>> Do you think that this relationship needs to be reported to userspace?
>>
>> You brought up the topic of relationships in
>> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
>> to learn more from the MPAM spec where I learned and went on tangent about all
>> the other possible namespaces without circling back.
>>
>> I was hoping that the namespace prefix would make the relationships clear,
>> something like <resource>_<control>, but I did not expect another layer in
>> the hierarchy like your example above. The idea of "parent" and "child" is
>> also not obvious to me at this point. resctrl gives us a "resource" to start
>> with and we are now discussing multiple controls per resource. Could you please
>> elaborate what you see as "parent" and "child"?
> 
> See above -- the parent/child concept is not an MPAM thing; apologies
> if I didn't make that clear.
> 
>> We do have the info directory available to express relationships and a
>> hierarchy is already starting to taking shape there.
> 
> I'm wondering whether using a common prefix will be future-proof?  It
> may not always be clear which part of a name counts as the common
> prefix.

Apologies for my cryptic response. I was actually musing that we already
discussed using the info directory to express relationships between
controls and resources and it does not seem a big leap to expand
this to express relationships between controls. Consider something
like below for MPAM:

info
└── MB
    └── resource_schemata
        └── MB
            └── MB_HW
                ├── MB_MAX
                └── MB_MIN


On RDT it may then look different:

info
└── MB
    └── resource_schemata
        └── MB
            ├── MB_HW
            ├── MB_MAX
            └── MB_MIN

Having the resource name as common prefix does seem consistent and makes
clear to user space which controls apply to a resource. 

> 
> There were already discussions about appending a number to a schema
> name in order to control different memory regions -- that's another
> prefix/suffix relationship, if so...
> 
> We could handle all of this by documenting all the relationships
> explicitly.  But I'm thinking that it could be easier for maintanance
> if the resctrl core code has explicit knowledge of the relationships.

Not just for resctrl self but to make clear to user space which
controls impact others and which are independent. 
> That said, using a common prefix is still a good idea.  But maybe we
> shouldn't lean on it too heavily as a way of actually describing the
> relationships?
I do not think we can rely on order in schemata file though. For example,
I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
case the schemata may print something like below on both platforms (copied from
your original example) where for MPAM it implies a relationship but for RDT it
does not:

MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32

 
>>> Since the "#" convention is for backward compatibility, maybe we should
>>> not use this for new schemata, and place the burden of managing
>>> conflicts onto userspace going forward.  What do you think?
>>
>> I agree. The way I understand this is that the '#' will only be used for
>> new controls that shadow the default/current controls of the legacy resources.
>> I do not expect that the prefix will be needed for new resources, even if
>> the initial support of a new resource does not include all possible controls.
> 
> OK.  Note, relating this to the above, the # could be interpreted as
> meaning "this is a child of some other schema; don't mess with it
> unless you know what you are doing".

Could it be made more specific to be "this is a child of a legacy schema created
before this new format existed; don't mess with it unless you know what you are
doing"?
That is, any schema created after this new format is established does not need
the '#' prefix even if there is a parent/child relationship?

> 
> Older software doesn't understand the relationships, so this is just
> there to stop it from shooting itself in the foot.

ack.

By extension I assume that software that understands a schema that is introduced
after the "relationship" format is established can be expected to understand the
format and thus these new schemata do not require the '#' prefix. Even if
a new schema is introduced with a single control it can be followed by a new child
control without a '#' prefix a couple of kernel releases later. By this point it
should hopefully be understood by user space that it should not write entries it does
not understand.

> 
> [...]
> 
>>>>>> MPAM has the "HARDLIM" distinction associated with these MAX values
>>>>>> and from what I can tell this is per PARTID. Is this something that needs
>>>>>> to be supported? To do this resctrl will need to support modifying
>>>>>> control properties per resource group.
>>>>>
>>>>> Possibly.  Since this is a boolean control that determines how the
>>>>> MBW_MAX control is applied, we could perhaps present it as an
>>>>> additional schema -- if so, it's basically orthogonal.
>>>>>
>>>>>  | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>>>>>
>>>>> or
>>>>>
>>>>>  | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>>>>>
>>>>> Does this look reasonable?
>>>>
>>>> It does.
>>>
>>> OK -- note, I don't think we have any immediate plan to support this in
>>> the MPAM driver, but it may land eventually in some form.
>>>
>>
>> ack.
> 
> (Or, of course, anything else that achieves the same goal...)

Right ... I did not dig into syntax that could be made to match existing
schema formats etc. that can be filled in later.

...

>>> I'll try to pull the state of this discussion together -- maybe as a
>>> draft update to the documentation, describing the interface as proposed
>>> so far.  Does that work for you?
>>
>> It does. Thank you very much for taking this on.
>>
>> Reinette
> 
> OK, I'll aim to follow up on this next week.

Thank you very much.

Reinette

next prev parent reply	other threads:[~2025-10-17 16:00 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin
2025-09-12 22:19 ` Reinette Chatre
2025-09-22 14:39   ` Dave Martin
2025-09-23 17:27     ` Reinette Chatre
2025-09-25 12:46       ` Dave Martin
2025-09-25 20:53         ` Reinette Chatre
2025-09-25 21:35           ` Luck, Tony
2025-09-25 22:18             ` Reinette Chatre
2025-09-29 13:08               ` Dave Martin
2025-09-29 12:43           ` Dave Martin
2025-09-29 15:38             ` Reinette Chatre
2025-09-29 16:10               ` Dave Martin
2025-10-15 15:18     ` Dave Martin
2025-10-16 15:57       ` Reinette Chatre
2025-10-17 15:52         ` Dave Martin
2025-09-22 15:04 ` Dave Martin
2025-09-25 22:58   ` Luck, Tony
2025-09-29  9:19     ` Chen, Yu C
2025-09-29 14:13       ` Dave Martin
2025-09-29 16:23         ` Luck, Tony
2025-09-30 11:02           ` Chen, Yu C
2025-09-30 16:08             ` Luck, Tony
2025-09-30  4:43         ` Chen, Yu C
2025-09-30 15:55           ` Dave Martin
2025-10-01 12:13             ` Chen, Yu C
2025-10-02 15:40               ` Dave Martin
2025-10-02 16:43                 ` Luck, Tony
2025-09-29 13:56     ` Dave Martin
2025-09-29 16:09       ` Reinette Chatre
2025-09-30 15:40         ` Dave Martin
2025-10-10 16:48           ` Reinette Chatre
2025-10-11 17:15             ` Chen, Yu C
2025-10-13 15:01               ` Dave Martin
2025-10-13 14:36             ` Dave Martin
2025-10-14 22:55               ` Reinette Chatre
2025-10-15 15:47                 ` Dave Martin
2025-10-15 18:48                   ` Luck, Tony
2025-10-16 14:50                     ` Dave Martin
2025-10-16 16:31                   ` Reinette Chatre
2025-10-17 14:17                     ` Dave Martin
2025-10-17 15:59                       ` Reinette Chatre [this message]
2025-10-20 15:50                         ` Dave Martin
2025-10-20 16:31                           ` Luck, Tony
2025-10-21 14:37                             ` Dave Martin
2025-10-21 20:59                               ` Luck, Tony
2025-10-22 14:58                                 ` Dave Martin
2025-10-22 16:21                                   ` Luck, Tony
2025-10-23 14:04                                     ` Dave Martin
2025-09-29 16:37       ` Luck, Tony
2025-09-30 16:02         ` Dave Martin
2025-09-26 20:54   ` Reinette Chatre
2025-09-29 13:40     ` Dave Martin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e788ca62-ec63-4552-978b-9569f369afd5@intel.com \
    --to=reinette.chatre@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox