From: Drew Fustini <fustini@kernel.org>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: Dave Martin <Dave.Martin@arm.com>,
linux-kernel@vger.kernel.org, Babu Moger <babu.moger@amd.com>,
Fenghua Yu <fenghuay@nvidia.com>, Tony Luck <tony.luck@intel.com>,
James Morse <james.morse@arm.com>,
"Chen, Yu C" <yu.c.chen@intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Jonathan Corbet <corbet@lwn.net>,
x86@kernel.org
Subject: Re: [RFC] fs/resctrl: Generic schema description
Date: Sat, 24 Jan 2026 10:09:49 -0800 [thread overview]
Message-ID: <aXUK7XFsHl+gnwA/@x1> (raw)
In-Reply-To: <fb1e2686-237b-4536-acd6-15159abafcba@intel.com>
On Tue, Dec 16, 2025 at 02:26:23PM -0800, Reinette Chatre wrote:
> Hi Babu and Fenghua,
>
> Could you please consider how the new AMD and MPAM features [2] may benefit
> from the new interfaces proposed here? More below ...
>
> On 10/24/25 4:12 AM, Dave Martin wrote:
> > Hi all,
> >
> > Going forward, a single resctrl resource (such as memory bandwidth) is
> > likely to require multiple schemata, either because we want to add new
> > schemata that provide finer control, or because the hardware has
> > multiple controls, covering different aspects of resource allocation.
> >
> > The fit between MPAM's memory bandwidth controls and the resctrl MB
> > schema is already awkward, and later Intel RDT features such as Region
> > Aware Memory Bandwidth Allocation are already pushing past what the MB
> > schema can describe. Both of these can involve multiple control
> > values and finer resolution than the 100 steps offered by the current
> > "MB" schema.
> >
> > The previous discussion went off in a few different directions [1], so
> > I want to focus back onto defining an extended schema description that
> > aims to cover the use cases that we know about or anticipate today, and
> > allows for future extension as needed.
> >
> > (A separate discussion is needed on how new schemata interact with
> > previously-defined schemata (such as the MB percentage schema).
> > suggest we pause that discussion for now, in the interests of getting
> > the schema description nailed down.)
> >
> >
> > Following on from the previous mail thread, I've tried to refine and
> > flesh out the proposal for schema descriptions a bit, as follows.
> >
> > Proposal:
> >
> > * Split resource names and schema names in resctrlfs.
> >
> > Resources will be named for the unique, existing schema for each
> > resource.
> >
> > The existing schema will keep its name (the same as the resource
> > name), and new schemata defined for a resource will include that
> > name as a prefix (at least, by default).
> >
> > So, for example, we will have an MB resource with a schema called
> > MB (the schema that we have already). But we may go on to define
> > additional schemata for the MB resource, with names such MB_MAX,
> > etc.
> >
> > * Stop adding new schema description information in the top-level
> > info/<resource>/ directory in resctrlfs.
> >
> > For backwards compatibilty, we can keep the existing property
> > files under the resource info directory to describe the previously
> > defined resource, but we seem to need something richer going
> > forward.
> >
> > * Add a hierarchy to list all the schemata for each resource, along
> > with their properties. So far, the proposal looks like this,
> > taking the MB resource as an example:
> >
> > info/
> > └─ MB/
> > └─ resource_schemata/
> > ├─ MB/
> > ├─ MB_MIN/
> > ├─ MB_MAX/
> > ┆
> >
> > Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> > In this proposal, what these just dummy schema names for
> > illustration purposes. The important thing is that they all
> > control aspects of the "MB" resource, and that there can be more
> > than one of them.
> >
> > It may be appropriate to have a nested hierarchy, where some
> > schemata are presented as children of other schemata if they
> > affect the same hardware controls. For now, let's put this issue
> > on one side, and consider what properties should be advertsed for
> > each schema.
> >
> > * Current properties that I think we might want are:
> >
> > info/
> > └─ SOME_RESOURCE/
> > └─ resource_schemata/
> > ├─ SOME_SCHEMA/
> > ┆ ├─ type
> > ├─ min
> > ├─ max
> > ├─ tolerance
> > ├─ resolution
> > ├─ scale
> > └─ unit
> >
> > (I've tweaked the properties a bit since previous postings.
> > "type" replaces "map"; "scale" is now the unit multiplier;
> > "resolution" is now a scaling divisor -- details below.)
> >
> > I assume that we expose the properties in individual files, but we
> > could also combine them into a single description file per schema,
> > per resource or (possibly) a single global file.
> > (I don't have a strong view on the best option.)
> >
> >
> > Either way, the following set of properties may be a reasonable
> > place to start:
> >
> >
> > type: the schema type, followed by optional flag specifiers:
> >
> > - "scalar": a single-valued numeric control
> >
> > A mandatory flag indicates how the control value written to
> > the schemata file is converted to an amount of resource for
> > hardware regulation.
> >
> > The flag "linear" indicates a linear mapping.
> >
> > In this case, the amount of resource E that is actually
> > allocated is derived from the control value C written to the
> > schemata file as follows:
> >
> > E = C * scale * unit / resolution
> >
> > Other flags values could be defined later, if we encounter
> > hardware with non-linear controls.
> >
> > - "bitmap": a bitmap control
> >
> > The optional flag "sparse" is present if the control accepts
> > sparse bitmaps.
> >
> > In this case, E = bitmap_weight(C) * scale * unit / resolution.
> >
> > As before, each bit controls access to a specific chunk of
> > resource in the hardware, such as a group of cache lines. All
> > chunks are equally sized.
> >
> > (Different CTRL_MON groups may still contend within the
> > allocation E, when they have bits in common between their
> > bitmaps.)
> >
> > min:
> >
> > - For a scalar schema, the minimum value that can be written to
> > the control when writing the schemata file.
> >
> > - For a bitmap schema, a bitmap of the minimum weight that the
> > schema accepts: if an empty bitmap is accepted, this can be 0.
> > Otherwise, if bitmaps with a single bit set are acceptable,
> > this can just have the lowest-order bit set.
> >
> > Most commonly, the value will probably be "1".
> >
> > For bitmap schemata, we might report this in hex. In the
> > interest of generic parsing, we could include a "0x" prefix if
> > so.
> >
> > max:
> >
> > - For a scalar schema, the maximum value that can be written to
> > the control when writing the schemata file.
> >
> > - For a bitmap schema, the mask with all bits set.
> >
> > Possibly reported in hex for bitmap schemata (as for "min").
> >
> > tolerance:
> >
> > (See below for discussion on this.)
> >
> > - "0": the control is exact
> >
> > - "1": the effective control value is within ±1 of the control
> > value written to the schemata file. (Similary, positive "n" ->
> > ±n.)
> >
> > A negative value could be used to indicate that the tolerance
> > is unknown. (Possibly we could also just omit the property,
> > though it seems better to warn userspace explicitly if we
> > don't know.)
> >
> > Tests might make use of this parameter in order to determine
> > how picky to be about exact measurement results.
> >
> > resolution:
> >
> > - For a proportional scalar schema: the number of divisions that
> > the whole resource is divided into. (See below for
> > "proportional scalar schema.)
> >
> > Typically, this will be the same as the "max" value.
> >
> > - For an absolute scalar schema: the divisor applied to the
> > control value.
> >
> > - For a bitmap schema: the size of the bitmap in bits.
> >
> > scale:
> >
> > - For a scalar schema: the scale-up multiplier applied to
> > "unit".
> >
> > - For a bitmap schema: probably "1".
> >
> > unit:
> >
> > - The base unit of the quantity measured by the control value.
> >
> > The special unit "all" denotes a proportional schema. In this
> > case, the resource is a finite, physical thing such as a cache
> > or maxed-out data throughput of a memory controller. The
> > entire physical resource is available for allocation, and the
> > control value indicates what proportion of it is allocated.
> >
> > Bitmap schemata will probably all be proportional and use the
> > unit "all". (This applies to cache bitmaps, at least.)
> >
> > Absolute schemata will require specification of the base unit
> > here, say, "MBps". The "scale" parameter can be used to avoid
> > proliferation of unit strings:
> >
> > For example, {scale=1000, unit="MBps"} would be equivalent to
> > {scale=1, unit="GBps"}.
> >
> >
> > Note on the "tolerance" parameter:
> >
> > This is a new addition. On the MPAM side, the hardware has a choice
> > about how to interpret the control value in some edge-case situations.
> > We may not reasonably be able to probe for this, so it may be useful
> > to warn software that there is an uncertainty margin.
> >
> > We might also be able to use the "tolerance" parameter to accommodate
> > the rounding behaviour of the existing "MB" schema (otherwise, we
> > might want a special "type" for this schema, if it doesn't comply
> > closely enough).
> >
> >
> > If we want to deploy resctrl under virtualisation, resctrl on the host
> > could dynamically affect the actual amount of resource that is
> > available for allocation inside a VM.
> >
> > Whether or not we ever want to do that, it might be useful to have a
> > way to warn software that the effective control values hitting the
> > hardware may not be entirely predictable.
> >
> > Thoughts?
> >
> > Cheers
> > ---Dave
>
>
> One thing I was pondering is that resctrl currently uses L3 interchangeably
> as a scope and a resource but if instead that is separated then it should be
> easier to support interactions with resource at a different scope.
>
> I am concerned that, for example, support for Global Memory Bandwidth Allocation
> (GMBA) is planned to be done with a new resource. resctrl already has a
> "memory bandwidth allocation" resource and introducing a new resource to essentially
> manage the same resource, but at a different scope, sounds like a risk of fragmentation
> and duplication to me.
>
> What if the "resource control" instead gains a new property, for example, "scope" that
> essentially communicates to user space what a domain ID in the schemata file means.
>
> It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> like below:
>
> info
> └── SMBA
> └── resource_schemata
> ├── SMBA
> │ ├── max
> │ ├── min
> │ ├── resolution
> │ ├── scale
> │ ├── scope <== contains "L3"
> │ ├── tolerance
> │ ├── type
> │ └── unit
> └── SMBA_NODE
> ├── max
> ├── min
> ├── resolution
> ├── scale
> ├── scope <== contains "NODE"
> ├── tolerance
> ├── type
> └── unit
>
> With an interface like above there is a single resource and allocating it at a different
> scope is just another control. This correlates to how other parts of resctrl is managed.
> For example, it can become explicit that the monitor groups' mon_data directory contains
> sub-directories organized by scope. For example:
>
> mon_data
> ├── mon_L3_00 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_NODE_00 <== monitoring data at scope NODE
> │ └── mbm_total_bytes
> └── mon_NODE_01 <== monitoring data at scope NODE
> └── mbm_total_bytes
>
> What do you think?
I think that the ability to have different scopes for a resource would
work well for QoS on RISC-V. The CBQRI spec [1] defines bandwidth
controller operations which can be anywhere in the system. I've been
having trouble trying to decide what to do about a CBQRI-enabled memory
controller as all bandwidth monitoring is currently assumed to be L3.
Therefore, my RFC series [2] that adds resctrl support for RISC-V does
not support bandwidth monitoring, but I think scope concept could make
it work.
Thanks,
Drew
[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0
[2] https://lore.kernel.org/all/20260119-ssqosid-cbqri-v1-0-aa2a75153832@kernel.org/
prev parent reply other threads:[~2026-01-24 18:09 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-24 11:12 [RFC] fs/resctrl: Generic schema description Dave Martin
2025-10-28 23:17 ` Reinette Chatre
2025-10-30 16:36 ` Dave Martin
2025-11-04 22:26 ` Reinette Chatre
2025-11-06 17:45 ` Reinette Chatre
2025-11-10 12:37 ` Ben Horgan
2025-12-16 22:26 ` Reinette Chatre
2025-12-26 10:38 ` Chen, Yu C
2026-01-07 15:53 ` Dave Martin
2026-01-09 16:09 ` Chen, Yu C
2026-01-24 18:09 ` Drew Fustini [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aXUK7XFsHl+gnwA/@x1 \
--to=fustini@kernel.org \
--cc=Dave.Martin@arm.com \
--cc=babu.moger@amd.com \
--cc=bp@alien8.de \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=fenghuay@nvidia.com \
--cc=hpa@zytor.com \
--cc=james.morse@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=reinette.chatre@intel.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.