public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed
From: "Luck, Tony" <tony.luck@intel.com>
To: Dave Martin <Dave.Martin@arm.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>,
	<linux-kernel@vger.kernel.org>, James Morse <james.morse@arm.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "Borislav Petkov" <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Jonathan Corbet <corbet@lwn.net>, <x86@kernel.org>,
	<linux-doc@vger.kernel.org>
Subject: Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
Date: Mon, 20 Oct 2025 09:31:18 -0700	[thread overview]
Message-ID: <aPZj1nDVEYmYytY9@agluck-desk3> (raw)
In-Reply-To: <aPZaTk97RC6sg+uQ@e133380.arm.com>

On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote:
> Hi Reinette,
> 
> On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote:
> > Hi Dave,
> > 
> > On 10/17/25 7:17 AM, Dave Martin wrote:
> > > Hi Reinette,
> > > 
> > > On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
> > >> Hi Dave,
> > >>
> > >> On 10/15/25 8:47 AM, Dave Martin wrote:
> 
> [...]
> 
> > >>> To avoid printing entries in the wrong order, do we want to track some
> > >>> parent/child relationship between schemata.
> > >>>
> > >>> In the above example,
> > >>>
> > >>> 	* MB is the parent of MB_HW;
> > >>>
> > >>> 	* MB_HW is the parent of MB_MIN and MB_MAX.
> > >>>
> > >>> (for MPAM, at least).
> > >>
> > >> Could you please elaborate this relationship? I envisioned the MB_HW to be
> > >> something similar to Intel RDT's "optimal" bandwidth setting ... something
> > >> that is expected to be somewhere between the "min" and the "max".
> > >>
> > >> But, now I think I'm a bit lost in MPAM since it is not clear to me what
> > >> MB_HW represents ... would this be the "memory bandwidth portion
> > >> partitioning"? Although, that uses a completely different format from
> > >> "min" and "max".
> > > 
> > > I confess that I'm thinking with an MPAM mindset here.
> > > 
> > > Some pseudocode might help to illustrate how these might interact:
> > > 
> > > 	set_MB(partid, val) {
> > > 		set_MB_HW(partid, percent_to_hw_val(val));
> 
> [...]
> 
> > > 	get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
> > > 
> > > 
> > > The parent/child relationship I suggested is basically the call-graph
> > > of this pseudocode.  These could all be exposed as resctrl schemata,
> > > but the children provide finer / more broken-down control than the
> > > parents.  Reading a parent provides a merged or approximated view of
> > > the configuration of the child schemata.
> > > 
> > > In particular,
> > > 
> > > 	set_child(partid, get_child(partid));
> > > 	get_parent(partid);
> > > 
> > > yields the same result as
> > > 
> > > 	get_parent(partid);
> > > 
> > > but will not be true in general, if the roles of parent and child are
> > > reversed.
> > > 
> > > I think still this holds true if implementing an "MB_HW" schema for
> > > newer revisions of RDT.  The pseudocode would be different, but there
> > > will still be a tree-like call graph (?)
> > 
> > Thank you very much for the example. I missed in earlier examples that
> > MB_HW was being controlled via MB_MAX and MB_MIN.
> > I do not expect such a dependence or tree-like call graph for RDT where
> > the closest equivalent (termed "optimal") is programmed independently from
> > min and max.
> 
> I hadn't realised that this RDT feature as three control thresholds.
> 
> I'll comment in more detail on your sample info/ hierarchy, below.
> 
> > > 
> > > Going back to MPAM:
> 
> [...]
> 
> > > So, it may make more sense to expose [MBWPBM] as a separate, bitmap schema.
> > > 
> > > (The same goes for "Proportional stride" partitioning.  It's another,
> > > different, control for memory bandwidth.  As of today, I don't think
> > > that we have a reference platform for experimenting with either of
> > > these.)
> > 
> > Thank you.
> > 
> > > 
> > > 
> > >>> When schemata is read, parents should always be printed before their
> > >>> child schemata.  But really, we just need to make sure that the
> > >>> rdt_schema_all list is correctly ordered.
> > >>>
> > >>>
> > >>> Do you think that this relationship needs to be reported to userspace?
> 
> [...]
> 
> > >> We do have the info directory available to express relationships and a
> > >> hierarchy is already starting to taking shape there.
> > > 
> > > I'm wondering whether using a common prefix will be future-proof?  It
> > > may not always be clear which part of a name counts as the common
> > > prefix.
> > 
> > Apologies for my cryptic response. I was actually musing that we already
> > discussed using the info directory to express relationships between
> > controls and resources and it does not seem a big leap to expand
> > this to express relationships between controls. Consider something
> > like below for MPAM:
> > 
> > info
> > └── MB
> >     └── resource_schemata
> >         └── MB
> >             └── MB_HW
> >                 ├── MB_MAX
> >                 └── MB_MIN
> > 
> > 
> > On RDT it may then look different:
> > 
> > info
> > └── MB
> >     └── resource_schemata
> >         └── MB
> >             ├── MB_HW
> >             ├── MB_MAX
> >             └── MB_MIN
> > 
> > Having the resource name as common prefix does seem consistent and makes
> > clear to user space which controls apply to a resource. 
> 
> Ack.
> 
> The above hierarchies make sense, but I wonder whether we should be
> forcing software to understand the MIN and MAX limits?
> 
> I can still see a benefit in having MB_HW be a generic, software-
> defined control, even on RDT.  Then, this can always be available,
> with similar behaviour, on all resctrl instances that support memory
> bandwidth controls.  The precise set of child controls will vary per
> arch (and on MPAM at least, between different hardware
> implementations) -- so these look like they will work less well as a
> generic interface.
> 
> 
> Considering RDT: to avoid random regulation behaviour, RDT says that
> you need MIN <= OPT <= MAX, so a generic "MB_HW" control that does not
> require software to understand the individual MIN, OPT and MAX
> thresholds would still need to program all of these under the hood so
> as to avoid an invalid combination being set in the hardware.
> 
> If I have understood the definition of the MARC table correctly, then
> there is a separate flag to report the presence of each of MIN, MAX and
> OPT, so software _might_ be expected to use a random subset of them(?)
> (If so, that's somewhat like the MPAM situation.)
> 
> So, I wonder whether we could actually have the following on RDT?
> 
>   info
>   ├── MB
>   ┆   └── resource_schemata
>           ├── MB
>           ┆   └── MB_HW
>                   ├── MB_MAX
>                   ├── MB_MIN
>                   └── MB_OPT
> 
> If MB_HW is programmed by software, then MB_MAX, MB_OPT and MB_MIN
> would be programmed with some reasonable default spread (or possibly,
> all with the same value).
> 
> That way, software that wants independent control over MIN, OPT and MAX
> can have it (and sweat the problem of dealing with hardware where they
> aren't all implemented -- if that's a thing).  But software that
> doesn't need this fine control gets a single MB_HW knob that is more-or-
> less portable between platforms.
> 
> Does that makes sense, or is it an abstraction too far?
> 
> 
> (Going one step further, maybe we can actually put MPAM and RDT
> together with a 3-threshold model.  For MPAM, we could possibly express
> the HARDLIM option using the extra threshold...  that probably needs a
> bit more thought, though.)
> 
> > > There were already discussions about appending a number to a schema
> > > name in order to control different memory regions -- that's another
> > > prefix/suffix relationship, if so...
> > > 
> > > We could handle all of this by documenting all the relationships
> > > explicitly.  But I'm thinking that it could be easier for maintanance
> > > if the resctrl core code has explicit knowledge of the relationships.
> > 
> > Not just for resctrl self but to make clear to user space which
> > controls impact others and which are independent. 
> > > That said, using a common prefix is still a good idea.  But maybe we
> > > shouldn't lean on it too heavily as a way of actually describing the
> > > relationships?
> > I do not think we can rely on order in schemata file though. For example,
> > I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
> > also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
> > case the schemata may print something like below on both platforms (copied from
> > your original example) where for MPAM it implies a relationship but for RDT it
> > does not:
> > 
> > MB: 0=50, 1=50
> > # MB_HW: 0=32, 1=32
> > # MB_MIN: 0=31, 1=31
> > # MB_MAX: 0=32, 1=32
> 
> This still DTRT though?  If MB_HW maps into the "optimal bandwidth"
> control on RDT, then it is still safe to program it first, before
> MB_{MIN,MAX}.
> 
> The contents of the schemata file won't be suffucient to figure out the
> relationships, but that wasn't my intention.  We have info/ for that.
> 
> Instead, the schemata file just needs to be ordered in a way that is
> compatible with those relationships, so that one line does not
> unintentionally clobber the effect of a subsequent line.
> 
> 
> My concern was that if we rely totally on manual maintenance to keep the
> schemata file in a compatible order, we'll probably get that wrong
> sooner or later...
> 
> > >>> Since the "#" convention is for backward compatibility, maybe we should
> > >>> not use this for new schemata, and place the burden of managing
> > >>> conflicts onto userspace going forward.  What do you think?
> > >>
> > >> I agree. The way I understand this is that the '#' will only be used for
> > >> new controls that shadow the default/current controls of the legacy resources.
> > >> I do not expect that the prefix will be needed for new resources, even if
> > >> the initial support of a new resource does not include all possible controls.
> > > 
> > > OK.  Note, relating this to the above, the # could be interpreted as
> > > meaning "this is a child of some other schema; don't mess with it
> > > unless you know what you are doing".
> > 
> > Could it be made more specific to be "this is a child of a legacy schema created
> > before this new format existed; don't mess with it unless you know what you are
> > doing"?
> > That is, any schema created after this new format is established does not need
> > the '#' prefix even if there is a parent/child relationship?
> 
> Yes, I think so.
> 
> Except: if some schema is advertised and documented with no children,
> then is it reasonable for software to assume that it will never have
> children?
> 
> I think that the answer is probably "yes", in which case would it make
> sense to # any schema that is a child of some schema that did not have
> children in some previous upstream kernel?
> 
> > > 
> > > Older software doesn't understand the relationships, so this is just
> > > there to stop it from shooting itself in the foot.
> > 
> > ack.
> > 
> > By extension I assume that software that understands a schema that is introduced
> > after the "relationship" format is established can be expected to understand the
> > format and thus these new schemata do not require the '#' prefix. Even if
> > a new schema is introduced with a single control it can be followed by a new child
> > control without a '#' prefix a couple of kernel releases later. By this point it
> > should hopefully be understood by user space that it should not write entries it does
> > not understand.
> 
> Generally, yes.
> 
> I think that boils down to: "OK, previously you could just tweak bits
> of the whole schemata file you read and write the whole thing back,
> and the effect would be what you inuitively expected.  But in future
> different schemata in the file may not be independent of one another.
> We'll warn you which things might not be independent, but we may not
> describe exactly how they affect each other.

Changes to the schemata file are currently "staged" and then applied.
There's some filesystem level error/sanity checking during the parsing
phase, but maybe for MB some parts can also be delayed, and re-ordered
when architecture code applies the changes.

E.g. while filesystem code could check min <= opt <= max. Architecture
code would be responsible to write the values to h/w in a sane manner
(assuming architecture cares about transient effects when things don't
conform to the ordering).

E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
Regardless of the order those requests appeared in the write(2) syscall
architecture bumps max to 60, then opt to 50, and finally min to 40.

> 
> "So, from now on, only write the things that you actually want to set."
> 
> Does that sound about right?

Users might still use their favorite editor on the schemata file and
so write everything, while only changing a subset. So if we don't go
for the full two-phase update I describe above this would be:

  "only *change* the things that you actually want to set".

> [...]
> 
> > >>>
> > >>> OK -- note, I don't think we have any immediate plan to support [HARDLIM] in
> > >>> the MPAM driver, but it may land eventually in some form.
> > >>>
> > >>
> > >> ack.
> > > 
> > > (Or, of course, anything else that achieves the same goal...)
> > 
> > Right ... I did not dig into syntax that could be made to match existing
> > schema formats etc. that can be filled in later.
> 
> Ack
> 
> > ...
> > 
> > >>> I'll try to pull the state of this discussion together -- maybe as a
> > >>> draft update to the documentation, describing the interface as proposed
> > >>> so far.  Does that work for you?
> > >>
> > >> It does. Thank you very much for taking this on.
> > >>
> > >> Reinette
> > > 
> > > OK, I'll aim to follow up on this next week.
> > 
> > Thank you very much.
> > 
> > Reinette
> 
> Cheers
> ---Dave

-Tony

  reply	other threads:[~2025-10-20 16:31 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin
2025-09-12 22:19 ` Reinette Chatre
2025-09-22 14:39   ` Dave Martin
2025-09-23 17:27     ` Reinette Chatre
2025-09-25 12:46       ` Dave Martin
2025-09-25 20:53         ` Reinette Chatre
2025-09-25 21:35           ` Luck, Tony
2025-09-25 22:18             ` Reinette Chatre
2025-09-29 13:08               ` Dave Martin
2025-09-29 12:43           ` Dave Martin
2025-09-29 15:38             ` Reinette Chatre
2025-09-29 16:10               ` Dave Martin
2025-10-15 15:18     ` Dave Martin
2025-10-16 15:57       ` Reinette Chatre
2025-10-17 15:52         ` Dave Martin
2025-09-22 15:04 ` Dave Martin
2025-09-25 22:58   ` Luck, Tony
2025-09-29  9:19     ` Chen, Yu C
2025-09-29 14:13       ` Dave Martin
2025-09-29 16:23         ` Luck, Tony
2025-09-30 11:02           ` Chen, Yu C
2025-09-30 16:08             ` Luck, Tony
2025-09-30  4:43         ` Chen, Yu C
2025-09-30 15:55           ` Dave Martin
2025-10-01 12:13             ` Chen, Yu C
2025-10-02 15:40               ` Dave Martin
2025-10-02 16:43                 ` Luck, Tony
2025-09-29 13:56     ` Dave Martin
2025-09-29 16:09       ` Reinette Chatre
2025-09-30 15:40         ` Dave Martin
2025-10-10 16:48           ` Reinette Chatre
2025-10-11 17:15             ` Chen, Yu C
2025-10-13 15:01               ` Dave Martin
2025-10-13 14:36             ` Dave Martin
2025-10-14 22:55               ` Reinette Chatre
2025-10-15 15:47                 ` Dave Martin
2025-10-15 18:48                   ` Luck, Tony
2025-10-16 14:50                     ` Dave Martin
2025-10-16 16:31                   ` Reinette Chatre
2025-10-17 14:17                     ` Dave Martin
2025-10-17 15:59                       ` Reinette Chatre
2025-10-20 15:50                         ` Dave Martin
2025-10-20 16:31                           ` Luck, Tony [this message]
2025-10-21 14:37                             ` Dave Martin
2025-10-21 20:59                               ` Luck, Tony
2025-10-22 14:58                                 ` Dave Martin
2025-10-22 16:21                                   ` Luck, Tony
2025-10-23 14:04                                     ` Dave Martin
2025-09-29 16:37       ` Luck, Tony
2025-09-30 16:02         ` Dave Martin
2025-09-26 20:54   ` Reinette Chatre
2025-09-29 13:40     ` Dave Martin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPZj1nDVEYmYytY9@agluck-desk3 \
    --to=tony.luck@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox