public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Drew Fustini <fustini@kernel.org>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: Dave Martin <Dave.Martin@arm.com>,
	linux-kernel@vger.kernel.org, Babu Moger <babu.moger@amd.com>,
	Fenghua Yu <fenghuay@nvidia.com>, Tony Luck <tony.luck@intel.com>,
	James Morse <james.morse@arm.com>,
	"Chen, Yu C" <yu.c.chen@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Jonathan Corbet <corbet@lwn.net>,
	x86@kernel.org
Subject: Re: [RFC] fs/resctrl: Generic schema description
Date: Sat, 24 Jan 2026 10:09:49 -0800	[thread overview]
Message-ID: <aXUK7XFsHl+gnwA/@x1> (raw)
In-Reply-To: <fb1e2686-237b-4536-acd6-15159abafcba@intel.com>

On Tue, Dec 16, 2025 at 02:26:23PM -0800, Reinette Chatre wrote:
> Hi Babu and Fenghua,
> 
> Could you please consider how the new AMD and MPAM features [2] may benefit
> from the new interfaces proposed here? More below ...
> 
> On 10/24/25 4:12 AM, Dave Martin wrote:
> > Hi all,
> > 
> > Going forward, a single resctrl resource (such as memory bandwidth) is
> > likely to require multiple schemata, either because we want to add new
> > schemata that provide finer control, or because the hardware has
> > multiple controls, covering different aspects of resource allocation.
> > 
> > The fit between MPAM's memory bandwidth controls and the resctrl MB
> > schema is already awkward, and later Intel RDT features such as Region
> > Aware Memory Bandwidth Allocation are already pushing past what the MB
> > schema can describe.  Both of these can involve multiple control
> > values and finer resolution than the 100 steps offered by the current
> > "MB" schema.
> > 
> > The previous discussion went off in a few different directions [1], so
> > I want to focus back onto defining an extended schema description that
> > aims to cover the use cases that we know about or anticipate today, and
> > allows for future extension as needed.
> > 
> > (A separate discussion is needed on how new schemata interact with
> > previously-defined schemata (such as the MB percentage schema). 
> > suggest we pause that discussion for now, in the interests of getting
> > the schema description nailed down.)
> > 
> > 
> > Following on from the previous mail thread, I've tried to refine and
> > flesh out the proposal for schema descriptions a bit, as follows.
> > 
> > Proposal:
> > 
> >   * Split resource names and schema names in resctrlfs.
> > 
> >     Resources will be named for the unique, existing schema for each
> >     resource.
> > 
> >     The existing schema will keep its name (the same as the resource
> >     name), and new schemata defined for a resource will include that
> >     name as a prefix (at least, by default).
> > 
> >     So, for example, we will have an MB resource with a schema called
> >     MB (the schema that we have already).  But we may go on to define
> >     additional schemata for the MB resource, with names such MB_MAX,
> >     etc.
> > 
> >   * Stop adding new schema description information in the top-level
> >     info/<resource>/ directory in resctrlfs.
> > 
> >     For backwards compatibilty, we can keep the existing property
> >     files under the resource info directory to describe the previously
> >     defined resource, but we seem to need something richer going
> >     forward.
> > 
> >   * Add a hierarchy to list all the schemata for each resource, along
> >     with their properties.  So far, the proposal looks like this,
> >     taking the MB resource as an example:
> > 
> > 	info/
> > 	 └─ MB/
> > 	     └─ resource_schemata/
> > 	         ├─ MB/
> > 	         ├─ MB_MIN/
> > 	         ├─ MB_MAX/
> > 	         ┆
> > 
> >     Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> >     In this proposal, what these just dummy schema names for
> >     illustration purposes.  The important thing is that they all
> >     control aspects of the "MB" resource, and that there can be more
> >     than one of them.
> > 
> >     It may be appropriate to have a nested hierarchy, where some
> >     schemata are presented as children of other schemata if they
> >     affect the same hardware controls.  For now, let's put this issue
> >     on one side, and consider what properties should be advertsed for
> >     each schema.
> > 
> >   * Current properties that I think we might want are:
> > 
> > 	info/
> > 	 └─ SOME_RESOURCE/
> > 	     └─ resource_schemata/
> > 	         ├─ SOME_SCHEMA/
> > 	         ┆   ├─ type
> > 	             ├─ min
> > 	             ├─ max
> > 	             ├─ tolerance
> > 	             ├─ resolution
> > 	             ├─ scale
> > 	             └─ unit
> > 
> >     (I've tweaked the properties a bit since previous postings.
> >     "type" replaces "map"; "scale" is now the unit multiplier;
> >     "resolution" is now a scaling divisor -- details below.)
> > 
> >     I assume that we expose the properties in individual files, but we
> >     could also combine them into a single description file per schema,
> >     per resource or (possibly) a single global file.
> >     (I don't have a strong view on the best option.)
> > 
> > 
> >     Either way, the following set of properties may be a reasonable
> >     place to start:
> > 
> > 
> >     type: the schema type, followed by optional flag specifiers:
> > 
> >       - "scalar": a single-valued numeric control
> > 
> >         A mandatory flag indicates how the control value written to
> >         the schemata file is converted to an amount of resource for
> >         hardware regulation.
> > 
> > 	The flag "linear" indicates a linear mapping.
> > 
> > 	In this case, the amount of resource E that is actually
> > 	allocated is derived from the control value C written to the
> > 	schemata file as follows:
> > 
> >     	E = C * scale * unit / resolution
> > 
> > 	Other flags values could be defined later, if we encounter
> > 	hardware with non-linear controls.
> > 
> >       - "bitmap": a bitmap control
> > 
> >         The optional flag "sparse" is present if the control accepts
> >         sparse bitmaps.
> > 
> > 	In this case, E = bitmap_weight(C) * scale * unit / resolution.
> > 
> > 	As before, each bit controls access to a specific chunk of
> > 	resource in the hardware, such as a group of cache lines.  All
> > 	chunks are equally sized.
> > 
> > 	(Different CTRL_MON groups may still contend within the
> > 	allocation E, when they have bits in common between their
> > 	bitmaps.)
> > 
> >     min:
> > 
> >       - For a scalar schema, the minimum value that can be written to
> >         the control when writing the schemata file.
> > 
> >       - For a bitmap schema, a bitmap of the minimum weight that the
> >         schema accepts: if an empty bitmap is accepted, this can be 0.
> >         Otherwise, if bitmaps with a single bit set are acceptable,
> >         this can just have the lowest-order bit set.
> > 
> > 	Most commonly, the value will probably be "1".
> > 
> > 	For bitmap schemata, we might report this in hex.  In the
> > 	interest of generic parsing, we could include a "0x" prefix if
> > 	so.
> > 
> >     max:
> > 
> >       - For a scalar schema, the maximum value that can be written to
> >         the control when writing the schemata file.
> > 
> >       - For a bitmap schema, the mask with all bits set.
> > 
> >         Possibly reported in hex for bitmap schemata (as for "min").
> > 
> >     tolerance:
> > 
> >         (See below for discussion on this.)
> > 
> >       - "0": the control is exact
> >       
> >       - "1": the effective control value is within ±1 of the control
> >         value written to the schemata file.  (Similary, positive "n" ->
> >         ±n.)
> > 
> >         A negative value could be used to indicate that the tolerance
> >         is unknown.  (Possibly we could also just omit the property,
> >         though it seems better to warn userspace explicitly if we
> >         don't know.)
> > 
> > 	Tests might make use of this parameter in order to determine
> > 	how picky to be about exact measurement results.
> > 
> >     resolution:
> > 
> >       - For a proportional scalar schema: the number of divisions that
> >         the whole resource is divided into.  (See below for
> >         "proportional scalar schema.)
> > 
> > 	Typically, this will be the same as the "max" value.
> > 
> >       - For an absolute scalar schema: the divisor applied to the
> >         control value.
> > 
> >       - For a bitmap schema: the size of the bitmap in bits.
> > 
> >     scale:
> > 
> >       - For a scalar schema: the scale-up multiplier applied to
> >         "unit".
> > 
> >       - For a bitmap schema: probably "1".
> > 
> >     unit:
> > 
> >       - The base unit of the quantity measured by the control value.
> > 
> >         The special unit "all" denotes a proportional schema.  In this
> >         case, the resource is a finite, physical thing such as a cache
> >         or maxed-out data throughput of a memory controller.  The
> >         entire physical resource is available for allocation, and the
> >         control value indicates what proportion of it is allocated.
> > 
> > 	Bitmap schemata will probably all be proportional and use the
> > 	unit "all".  (This applies to cache bitmaps, at least.)
> > 
> > 	Absolute schemata will require specification of the base unit
> > 	here, say, "MBps".  The "scale" parameter can be used to avoid
> > 	proliferation of unit strings:
> > 
> > 	For example, {scale=1000, unit="MBps"} would be equivalent to
> > 	{scale=1, unit="GBps"}.
> > 
> > 
> > Note on the "tolerance" parameter:
> > 
> > This is a new addition.  On the MPAM side, the hardware has a choice
> > about how to interpret the control value in some edge-case situations.
> > We may not reasonably be able to probe for this, so it may be useful
> > to warn software that there is an uncertainty margin.
> > 
> > We might also be able to use the "tolerance" parameter to accommodate
> > the rounding behaviour of the existing "MB" schema (otherwise, we
> > might want a special "type" for this schema, if it doesn't comply
> > closely enough).
> > 
> > 
> > If we want to deploy resctrl under virtualisation, resctrl on the host
> > could dynamically affect the actual amount of resource that is
> > available for allocation inside a VM.
> > 
> > Whether or not we ever want to do that, it might be useful to have a
> > way to warn software that the effective control values hitting the
> > hardware may not be entirely predictable.
> > 
> > Thoughts?
> > 
> > Cheers
> > ---Dave
> 
> 
> One thing I was pondering is that resctrl currently uses L3 interchangeably
> as a scope and a resource but if instead that is separated then it should be
> easier to support interactions with resource at a different scope.
> 
> I am concerned that, for example, support for Global Memory Bandwidth Allocation
> (GMBA) is planned to be done with a new resource. resctrl already has a
> "memory bandwidth allocation" resource and introducing a new resource to essentially
> manage the same resource, but at a different scope, sounds like a risk of fragmentation
> and duplication to me.
> 
> What if the "resource control" instead gains a new property, for example, "scope" that
> essentially communicates to user space what a domain ID in the schemata file means.
> 
> It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> like below:
> 
> info
> └── SMBA
>     └── resource_schemata
>         ├── SMBA
>         │   ├── max
>         │   ├── min
>         │   ├── resolution
>         │   ├── scale
>         │   ├── scope <== contains "L3"
>         │   ├── tolerance
>         │   ├── type
>         │   └── unit
>         └── SMBA_NODE
>             ├── max
>             ├── min
>             ├── resolution
>             ├── scale
>             ├── scope <== contains "NODE"
>             ├── tolerance
>             ├── type
>             └── unit
> 
> With an interface like above there is a single resource and allocating it at a different
> scope is just another control. This correlates to how other parts of resctrl is managed.
> For example, it can become explicit that the monitor groups' mon_data  directory contains
> sub-directories organized by scope. For example:
> 
> mon_data
> ├── mon_L3_00       <== monitoring data at scope L3
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_01       <== monitoring data at scope L3
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_NODE_00     <== monitoring data at scope NODE
> │   └── mbm_total_bytes
> └── mon_NODE_01     <== monitoring data at scope NODE
>     └── mbm_total_bytes
> 
> What do you think?

I think that the ability to have different scopes for a resource would
work well for QoS on RISC-V. The CBQRI spec [1] defines bandwidth
controller operations which can be anywhere in the system. I've been
having trouble trying to decide what to do about a CBQRI-enabled memory
controller as all bandwidth monitoring is currently assumed to be L3.

Therefore, my RFC series [2] that adds resctrl support for RISC-V does
not support bandwidth monitoring, but I think scope concept could make
it work.

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0
[2] https://lore.kernel.org/all/20260119-ssqosid-cbqri-v1-0-aa2a75153832@kernel.org/

      parent reply	other threads:[~2026-01-24 18:09 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-24 11:12 [RFC] fs/resctrl: Generic schema description Dave Martin
2025-10-28 23:17 ` Reinette Chatre
2025-10-30 16:36   ` Dave Martin
2025-11-04 22:26     ` Reinette Chatre
2025-11-06 17:45       ` Reinette Chatre
2025-11-10 12:37 ` Ben Horgan
2025-12-16 22:26 ` Reinette Chatre
2025-12-26 10:38   ` Chen, Yu C
2026-01-07 15:53     ` Dave Martin
2026-01-09 16:09       ` Chen, Yu C
2026-01-24 18:09   ` Drew Fustini [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXUK7XFsHl+gnwA/@x1 \
    --to=fustini@kernel.org \
    --cc=Dave.Martin@arm.com \
    --cc=babu.moger@amd.com \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=fenghuay@nvidia.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox