* [RFC] fs/resctrl: Generic schema description
@ 2025-10-24 11:12 Dave Martin
2025-10-28 23:17 ` Reinette Chatre
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Dave Martin @ 2025-10-24 11:12 UTC (permalink / raw)
To: linux-kernel
Cc: Tony Luck, Reinette Chatre, James Morse, Chen, Yu C,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86
Hi all,
Going forward, a single resctrl resource (such as memory bandwidth) is
likely to require multiple schemata, either because we want to add new
schemata that provide finer control, or because the hardware has
multiple controls, covering different aspects of resource allocation.
The fit between MPAM's memory bandwidth controls and the resctrl MB
schema is already awkward, and later Intel RDT features such as Region
Aware Memory Bandwidth Allocation are already pushing past what the MB
schema can describe. Both of these can involve multiple control
values and finer resolution than the 100 steps offered by the current
"MB" schema.
The previous discussion went off in a few different directions [1], so
I want to focus back onto defining an extended schema description that
aims to cover the use cases that we know about or anticipate today, and
allows for future extension as needed.
(A separate discussion is needed on how new schemata interact with
previously-defined schemata (such as the MB percentage schema).
suggest we pause that discussion for now, in the interests of getting
the schema description nailed down.)
Following on from the previous mail thread, I've tried to refine and
flesh out the proposal for schema descriptions a bit, as follows.
Proposal:
* Split resource names and schema names in resctrlfs.
Resources will be named for the unique, existing schema for each
resource.
The existing schema will keep its name (the same as the resource
name), and new schemata defined for a resource will include that
name as a prefix (at least, by default).
So, for example, we will have an MB resource with a schema called
MB (the schema that we have already). But we may go on to define
additional schemata for the MB resource, with names such MB_MAX,
etc.
* Stop adding new schema description information in the top-level
info/<resource>/ directory in resctrlfs.
For backwards compatibilty, we can keep the existing property
files under the resource info directory to describe the previously
defined resource, but we seem to need something richer going
forward.
* Add a hierarchy to list all the schemata for each resource, along
with their properties. So far, the proposal looks like this,
taking the MB resource as an example:
info/
└─ MB/
└─ resource_schemata/
├─ MB/
├─ MB_MIN/
├─ MB_MAX/
┆
Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
In this proposal, what these just dummy schema names for
illustration purposes. The important thing is that they all
control aspects of the "MB" resource, and that there can be more
than one of them.
It may be appropriate to have a nested hierarchy, where some
schemata are presented as children of other schemata if they
affect the same hardware controls. For now, let's put this issue
on one side, and consider what properties should be advertsed for
each schema.
* Current properties that I think we might want are:
info/
└─ SOME_RESOURCE/
└─ resource_schemata/
├─ SOME_SCHEMA/
┆ ├─ type
├─ min
├─ max
├─ tolerance
├─ resolution
├─ scale
└─ unit
(I've tweaked the properties a bit since previous postings.
"type" replaces "map"; "scale" is now the unit multiplier;
"resolution" is now a scaling divisor -- details below.)
I assume that we expose the properties in individual files, but we
could also combine them into a single description file per schema,
per resource or (possibly) a single global file.
(I don't have a strong view on the best option.)
Either way, the following set of properties may be a reasonable
place to start:
type: the schema type, followed by optional flag specifiers:
- "scalar": a single-valued numeric control
A mandatory flag indicates how the control value written to
the schemata file is converted to an amount of resource for
hardware regulation.
The flag "linear" indicates a linear mapping.
In this case, the amount of resource E that is actually
allocated is derived from the control value C written to the
schemata file as follows:
E = C * scale * unit / resolution
Other flags values could be defined later, if we encounter
hardware with non-linear controls.
- "bitmap": a bitmap control
The optional flag "sparse" is present if the control accepts
sparse bitmaps.
In this case, E = bitmap_weight(C) * scale * unit / resolution.
As before, each bit controls access to a specific chunk of
resource in the hardware, such as a group of cache lines. All
chunks are equally sized.
(Different CTRL_MON groups may still contend within the
allocation E, when they have bits in common between their
bitmaps.)
min:
- For a scalar schema, the minimum value that can be written to
the control when writing the schemata file.
- For a bitmap schema, a bitmap of the minimum weight that the
schema accepts: if an empty bitmap is accepted, this can be 0.
Otherwise, if bitmaps with a single bit set are acceptable,
this can just have the lowest-order bit set.
Most commonly, the value will probably be "1".
For bitmap schemata, we might report this in hex. In the
interest of generic parsing, we could include a "0x" prefix if
so.
max:
- For a scalar schema, the maximum value that can be written to
the control when writing the schemata file.
- For a bitmap schema, the mask with all bits set.
Possibly reported in hex for bitmap schemata (as for "min").
tolerance:
(See below for discussion on this.)
- "0": the control is exact
- "1": the effective control value is within ±1 of the control
value written to the schemata file. (Similary, positive "n" ->
±n.)
A negative value could be used to indicate that the tolerance
is unknown. (Possibly we could also just omit the property,
though it seems better to warn userspace explicitly if we
don't know.)
Tests might make use of this parameter in order to determine
how picky to be about exact measurement results.
resolution:
- For a proportional scalar schema: the number of divisions that
the whole resource is divided into. (See below for
"proportional scalar schema.)
Typically, this will be the same as the "max" value.
- For an absolute scalar schema: the divisor applied to the
control value.
- For a bitmap schema: the size of the bitmap in bits.
scale:
- For a scalar schema: the scale-up multiplier applied to
"unit".
- For a bitmap schema: probably "1".
unit:
- The base unit of the quantity measured by the control value.
The special unit "all" denotes a proportional schema. In this
case, the resource is a finite, physical thing such as a cache
or maxed-out data throughput of a memory controller. The
entire physical resource is available for allocation, and the
control value indicates what proportion of it is allocated.
Bitmap schemata will probably all be proportional and use the
unit "all". (This applies to cache bitmaps, at least.)
Absolute schemata will require specification of the base unit
here, say, "MBps". The "scale" parameter can be used to avoid
proliferation of unit strings:
For example, {scale=1000, unit="MBps"} would be equivalent to
{scale=1, unit="GBps"}.
Note on the "tolerance" parameter:
This is a new addition. On the MPAM side, the hardware has a choice
about how to interpret the control value in some edge-case situations.
We may not reasonably be able to probe for this, so it may be useful
to warn software that there is an uncertainty margin.
We might also be able to use the "tolerance" parameter to accommodate
the rounding behaviour of the existing "MB" schema (otherwise, we
might want a special "type" for this schema, if it doesn't comply
closely enough).
If we want to deploy resctrl under virtualisation, resctrl on the host
could dynamically affect the actual amount of resource that is
available for allocation inside a VM.
Whether or not we ever want to do that, it might be useful to have a
way to warn software that the effective control values hitting the
hardware may not be entirely predictable.
Thoughts?
Cheers
---Dave
[1] Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-10-24 11:12 [RFC] fs/resctrl: Generic schema description Dave Martin
@ 2025-10-28 23:17 ` Reinette Chatre
2025-10-30 16:36 ` Dave Martin
2025-11-10 12:37 ` Ben Horgan
2025-12-16 22:26 ` Reinette Chatre
2 siblings, 1 reply; 11+ messages in thread
From: Reinette Chatre @ 2025-10-28 23:17 UTC (permalink / raw)
To: Dave Martin, linux-kernel
Cc: Tony Luck, James Morse, Chen, Yu C, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86
Hi Dave,
On 10/24/25 4:12 AM, Dave Martin wrote:
> Hi all,
>
> Going forward, a single resctrl resource (such as memory bandwidth) is
> likely to require multiple schemata, either because we want to add new
> schemata that provide finer control, or because the hardware has
> multiple controls, covering different aspects of resource allocation.
>
> The fit between MPAM's memory bandwidth controls and the resctrl MB
> schema is already awkward, and later Intel RDT features such as Region
> Aware Memory Bandwidth Allocation are already pushing past what the MB
> schema can describe. Both of these can involve multiple control
> values and finer resolution than the 100 steps offered by the current
> "MB" schema.
>
> The previous discussion went off in a few different directions [1], so
> I want to focus back onto defining an extended schema description that
> aims to cover the use cases that we know about or anticipate today, and
> allows for future extension as needed.
>
> (A separate discussion is needed on how new schemata interact with
> previously-defined schemata (such as the MB percentage schema).
> suggest we pause that discussion for now, in the interests of getting
> the schema description nailed down.)
ok, but let's keep this as "open #1"
> Following on from the previous mail thread, I've tried to refine and
> flesh out the proposal for schema descriptions a bit, as follows.
>
> Proposal:
>
> * Split resource names and schema names in resctrlfs.
>
> Resources will be named for the unique, existing schema for each
> resource.
Are you referring to the implementation or how things are exposed to user
space? I am trying to understand how the existing L3CODE/L3DATA schemata
fit in ... they are presented to user space as two separate resources since
they each have their own directory in "info" while internally they are
schema of the L3 resource.
Just trying to understand if you are talking about reverting
https://lore.kernel.org/all/20210728170637.25610-1-james.morse@arm.com/ ?
The current implementation appears to match this proposal so we may need to
have special cases to keep CDP backwards compatible.
SMBA may also need some extra care ... especially if other architectures start
to allocate memory bandwidth to CXL resource via their "MB" resource.
> The existing schema will keep its name (the same as the resource
> name), and new schemata defined for a resource will include that
> name as a prefix (at least, by default).
>
> So, for example, we will have an MB resource with a schema called
> MB (the schema that we have already). But we may go on to define
> additional schemata for the MB resource, with names such MB_MAX,
> etc.
>
> * Stop adding new schema description information in the top-level
> info/<resource>/ directory in resctrlfs.
>
> For backwards compatibilty, we can keep the existing property
> files under the resource info directory to describe the previously
> defined resource, but we seem to need something richer going
> forward.
>
> * Add a hierarchy to list all the schemata for each resource, along
> with their properties. So far, the proposal looks like this,
> taking the MB resource as an example:
>
> info/
> └─ MB/
> └─ resource_schemata/
> ├─ MB/
> ├─ MB_MIN/
> ├─ MB_MAX/
> ┆
>
> Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> In this proposal, what these just dummy schema names for
> illustration purposes. The important thing is that they all
> control aspects of the "MB" resource, and that there can be more
> than one of them.
>
> It may be appropriate to have a nested hierarchy, where some
> schemata are presented as children of other schemata if they
> affect the same hardware controls. For now, let's put this issue
> on one side, and consider what properties should be advertsed for
> each schema.
ok to put this aside but I think we should keep including it, "open #2" ?
>
> * Current properties that I think we might want are:
>
> info/
> └─ SOME_RESOURCE/
> └─ resource_schemata/
> ├─ SOME_SCHEMA/
> ┆ ├─ type
> ├─ min
> ├─ max
> ├─ tolerance
> ├─ resolution
> ├─ scale
> └─ unit
>
> (I've tweaked the properties a bit since previous postings.
> "type" replaces "map"; "scale" is now the unit multiplier;
> "resolution" is now a scaling divisor -- details below.)
>
> I assume that we expose the properties in individual files, but we
> could also combine them into a single description file per schema,
> per resource or (possibly) a single global file.
> (I don't have a strong view on the best option.)
>
>
> Either way, the following set of properties may be a reasonable
> place to start:
>
>
> type: the schema type, followed by optional flag specifiers:
>
> - "scalar": a single-valued numeric control
>
> A mandatory flag indicates how the control value written to
> the schemata file is converted to an amount of resource for
> hardware regulation.
>
> The flag "linear" indicates a linear mapping.
>
> In this case, the amount of resource E that is actually
> allocated is derived from the control value C written to the
> schemata file as follows:
>
> E = C * scale * unit / resolution
>
> Other flags values could be defined later, if we encounter
> hardware with non-linear controls.
>
> - "bitmap": a bitmap control
>
> The optional flag "sparse" is present if the control accepts
> sparse bitmaps.
>
> In this case, E = bitmap_weight(C) * scale * unit / resolution.
>
> As before, each bit controls access to a specific chunk of
> resource in the hardware, such as a group of cache lines. All
> chunks are equally sized.
>
> (Different CTRL_MON groups may still contend within the
> allocation E, when they have bits in common between their
> bitmaps.)
Would it not be simpler to have the files/properties depend on the
schema type? It almost seems as though some of the properties are forced
to have some meaning for bitmap when they do not seem to be needed. Instead,
for a bitmap type there can be bitmap specific properties like, for example,
bit_usage. This may also create more flexibility when there is a future
mapping function needed that depends on some new property?
Reinette
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-10-28 23:17 ` Reinette Chatre
@ 2025-10-30 16:36 ` Dave Martin
2025-11-04 22:26 ` Reinette Chatre
0 siblings, 1 reply; 11+ messages in thread
From: Dave Martin @ 2025-10-30 16:36 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Chen, Yu C, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86
Hi Reinette,
On Tue, Oct 28, 2025 at 04:17:05PM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 10/24/25 4:12 AM, Dave Martin wrote:
> > Hi all,
> >
> > Going forward, a single resctrl resource (such as memory bandwidth) is
> > likely to require multiple schemata, either because we want to add new
> > schemata that provide finer control, or because the hardware has
> > multiple controls, covering different aspects of resource allocation.
> >
> > The fit between MPAM's memory bandwidth controls and the resctrl MB
> > schema is already awkward, and later Intel RDT features such as Region
> > Aware Memory Bandwidth Allocation are already pushing past what the MB
> > schema can describe. Both of these can involve multiple control
> > values and finer resolution than the 100 steps offered by the current
> > "MB" schema.
> >
> > The previous discussion went off in a few different directions [1], so
> > I want to focus back onto defining an extended schema description that
> > aims to cover the use cases that we know about or anticipate today, and
> > allows for future extension as needed.
> >
> > (A separate discussion is needed on how new schemata interact with
> > previously-defined schemata (such as the MB percentage schema).
> > suggest we pause that discussion for now, in the interests of getting
> > the schema description nailed down.)
>
> ok, but let's keep this as "open #1"
>
> > Following on from the previous mail thread, I've tried to refine and
> > flesh out the proposal for schema descriptions a bit, as follows.
> >
> > Proposal:
> >
> > * Split resource names and schema names in resctrlfs.
> >
> > Resources will be named for the unique, existing schema for each
> > resource.
>
> Are you referring to the implementation or how things are exposed to user
> space? I am trying to understand how the existing L3CODE/L3DATA schemata
> fit in ... they are presented to user space as two separate resources since
> they each have their own directory in "info" while internally they are
> schema of the L3 resource.
Good question -- I didn't take into account here the fact that some
physical resources already have multiple schemata exposed to userspace.
I've probably overformalised, here. I'm not proposing to refactor the
arrangement of existing schemata and resources.
So we would continue to have
info/L3CODE/resource_schemata/L3CODE/ and
info/L3DATA/resource_schemata/L3DATA/.
I think that the decision to combine these under a single resctrl
resource internally is the most logical one, but I'm proposing just to
extend the info/ content, without unnecssary changes.
The current arrangement does have one shortcoming, which is that
software doesn't know (other than by built-in knowledge) that L3CODE
and L3DATA claim resource from the same hardware pool, so
L3CODE:0=0001
L3DATA:0=0001
implies that the transactions on the I-side and D-side contend for
cache lines (unless there are separate L3 I- and D-caches -- but I
don't think that's a thing on any relevant system...)
So, we might want some way to indicate that L3CODE and L3DATA are
linked. But I think that CDP is a unique case where we can reasonably
expect some built-in userspace knowledge.
I didn't currently plan to address this, but it could come later if we
think it's important.
> Just trying to understand if you are talking about reverting
> https://lore.kernel.org/all/20210728170637.25610-1-james.morse@arm.com/ ?
No...
> The current implementation appears to match this proposal so we may need to
> have special cases to keep CDP backwards compatible.
>
> SMBA may also need some extra care ... especially if other architectures start
> to allocate memory bandwidth to CXL resource via their "MB" resource.
Perhaps. I think it may be necessary to hack up and implementation of
these changes, to flush out things that don't quite fit.
>
> > The existing schema will keep its name (the same as the resource
> > name), and new schemata defined for a resource will include that
> > name as a prefix (at least, by default).
> >
> > So, for example, we will have an MB resource with a schema called
> > MB (the schema that we have already). But we may go on to define
> > additional schemata for the MB resource, with names such MB_MAX,
> > etc.
> >
> > * Stop adding new schema description information in the top-level
> > info/<resource>/ directory in resctrlfs.
> >
> > For backwards compatibilty, we can keep the existing property
> > files under the resource info directory to describe the previously
> > defined resource, but we seem to need something richer going
> > forward.
> >
> > * Add a hierarchy to list all the schemata for each resource, along
> > with their properties. So far, the proposal looks like this,
> > taking the MB resource as an example:
> >
> > info/
> > └─ MB/
> > └─ resource_schemata/
> > ├─ MB/
> > ├─ MB_MIN/
> > ├─ MB_MAX/
> > ┆
> >
> > Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> > In this proposal, what these just dummy schema names for
> > illustration purposes. The important thing is that they all
> > control aspects of the "MB" resource, and that there can be more
> > than one of them.
> >
> > It may be appropriate to have a nested hierarchy, where some
> > schemata are presented as children of other schemata if they
> > affect the same hardware controls. For now, let's put this issue
> > on one side, and consider what properties should be advertsed for
> > each schema.
>
> ok to put this aside but I think we should keep including it, "open #2" ?
Yes; I'm not abandoning this, but I wanted to focus on the schema
description, here.
> > * Current properties that I think we might want are:
> >
> > info/
> > └─ SOME_RESOURCE/
> > └─ resource_schemata/
> > ├─ SOME_SCHEMA/
> > ┆ ├─ type
> > ├─ min
> > ├─ max
> > ├─ tolerance
> > ├─ resolution
> > ├─ scale
> > └─ unit
> >
> > (I've tweaked the properties a bit since previous postings.
> > "type" replaces "map"; "scale" is now the unit multiplier;
> > "resolution" is now a scaling divisor -- details below.)
> >
> > I assume that we expose the properties in individual files, but we
> > could also combine them into a single description file per schema,
> > per resource or (possibly) a single global file.
> > (I don't have a strong view on the best option.)
> >
> >
> > Either way, the following set of properties may be a reasonable
> > place to start:
> >
> >
> > type: the schema type, followed by optional flag specifiers:
> >
> > - "scalar": a single-valued numeric control
> >
> > A mandatory flag indicates how the control value written to
> > the schemata file is converted to an amount of resource for
> > hardware regulation.
> >
> > The flag "linear" indicates a linear mapping.
> >
> > In this case, the amount of resource E that is actually
> > allocated is derived from the control value C written to the
> > schemata file as follows:
> >
> > E = C * scale * unit / resolution
> >
> > Other flags values could be defined later, if we encounter
> > hardware with non-linear controls.
> >
> > - "bitmap": a bitmap control
> >
> > The optional flag "sparse" is present if the control accepts
> > sparse bitmaps.
> >
> > In this case, E = bitmap_weight(C) * scale * unit / resolution.
> >
> > As before, each bit controls access to a specific chunk of
> > resource in the hardware, such as a group of cache lines. All
> > chunks are equally sized.
> >
> > (Different CTRL_MON groups may still contend within the
> > allocation E, when they have bits in common between their
> > bitmaps.)
>
> Would it not be simpler to have the files/properties depend on the
> schema type? It almost seems as though some of the properties are forced
> to have some meaning for bitmap when they do not seem to be needed. Instead,
> for a bitmap type there can be bitmap specific properties like, for example,
> bit_usage. This may also create more flexibility when there is a future
> mapping function needed that depends on some new property?
>
> Reinette
Sure, there is no reason why the set of properties has to be identical
for different schema types.
It turned out that a single set of properties fitted better than I
expected, so I presented things that way to see what people thought
about it.
For bitmaps, there isn't a strong need to change the set of properties
already available in the top-level info/ directories. These can be
adopted into the new info under resource_schemata/, but I might be
tempted to rename them to remove "cbm" string so that the names are
applicable to all bitmap- style resources. I might also rename the
min_cbm_bits property if we can think of a more intuitive name -- it's
not obvious how this should apply to sparse bitmaps.
Thinking about bit_usage, is that really per-schema?
If L3CODE and L3DATA are really allocating the same underlying
resource, I wonder whether their bit_usage should be combined,
somehow.
This might be one for later, though.
It doesn't look necessary to adopt all existing properties into the
extended schema description immediately -- if there are some that don't
quite fit, we could adopt them later on without breaking backwards
compatibilty.
Do you see a risk, there?
Cheers
---Dave
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-10-30 16:36 ` Dave Martin
@ 2025-11-04 22:26 ` Reinette Chatre
2025-11-06 17:45 ` Reinette Chatre
0 siblings, 1 reply; 11+ messages in thread
From: Reinette Chatre @ 2025-11-04 22:26 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Tony Luck, James Morse, Chen, Yu C, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86
Hi Dave,
On 10/30/25 9:36 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Tue, Oct 28, 2025 at 04:17:05PM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/24/25 4:12 AM, Dave Martin wrote:
>>> Hi all,
>>>
>>> Going forward, a single resctrl resource (such as memory bandwidth) is
>>> likely to require multiple schemata, either because we want to add new
>>> schemata that provide finer control, or because the hardware has
>>> multiple controls, covering different aspects of resource allocation.
>>>
>>> The fit between MPAM's memory bandwidth controls and the resctrl MB
>>> schema is already awkward, and later Intel RDT features such as Region
>>> Aware Memory Bandwidth Allocation are already pushing past what the MB
>>> schema can describe. Both of these can involve multiple control
>>> values and finer resolution than the 100 steps offered by the current
>>> "MB" schema.
>>>
>>> The previous discussion went off in a few different directions [1], so
>>> I want to focus back onto defining an extended schema description that
>>> aims to cover the use cases that we know about or anticipate today, and
>>> allows for future extension as needed.
>>>
>>> (A separate discussion is needed on how new schemata interact with
>>> previously-defined schemata (such as the MB percentage schema).
>>> suggest we pause that discussion for now, in the interests of getting
>>> the schema description nailed down.)
>>
>> ok, but let's keep this as "open #1"
>>
>>> Following on from the previous mail thread, I've tried to refine and
>>> flesh out the proposal for schema descriptions a bit, as follows.
>>>
>>> Proposal:
>>>
>>> * Split resource names and schema names in resctrlfs.
>>>
>>> Resources will be named for the unique, existing schema for each
>>> resource.
>>
>> Are you referring to the implementation or how things are exposed to user
>> space? I am trying to understand how the existing L3CODE/L3DATA schemata
>> fit in ... they are presented to user space as two separate resources since
>> they each have their own directory in "info" while internally they are
>> schema of the L3 resource.
>
> Good question -- I didn't take into account here the fact that some
> physical resources already have multiple schemata exposed to userspace.
>
> I've probably overformalised, here. I'm not proposing to refactor the
> arrangement of existing schemata and resources.
>
> So we would continue to have
> info/L3CODE/resource_schemata/L3CODE/ and
> info/L3DATA/resource_schemata/L3DATA/.
>
>
> I think that the decision to combine these under a single resctrl
> resource internally is the most logical one, but I'm proposing just to
> extend the info/ content, without unnecssary changes.
Thank you for confirming. This matches the way I was thinking about this work.
>
> The current arrangement does have one shortcoming, which is that
> software doesn't know (other than by built-in knowledge) that L3CODE
> and L3DATA claim resource from the same hardware pool, so
>
> L3CODE:0=0001
> L3DATA:0=0001
>
> implies that the transactions on the I-side and D-side contend for
> cache lines (unless there are separate L3 I- and D-caches -- but I
> don't think that's a thing on any relevant system...)
>
> So, we might want some way to indicate that L3CODE and L3DATA are
> linked. But I think that CDP is a unique case where we can reasonably
> expect some built-in userspace knowledge.
I'll admit that it is not as obvious as this new interface would make it be
for new schemata but userspace is not entirely left to its own devices.
resctrl will ensure that these resources do not overlap when, for example,
a resource group is exclusive. For example, an L3CODE allocation in one
resource group cannot be created to overlap with an L3DATA allocation in
another when one of the resource groups is exclusive.
>
> I didn't currently plan to address this, but it could come later if we
> think it's important.
>
>> Just trying to understand if you are talking about reverting
>> https://lore.kernel.org/all/20210728170637.25610-1-james.morse@arm.com/ ?
>
> No...
>
>> The current implementation appears to match this proposal so we may need to
>> have special cases to keep CDP backwards compatible.
>>
>> SMBA may also need some extra care ... especially if other architectures start
>> to allocate memory bandwidth to CXL resource via their "MB" resource.
>
> Perhaps. I think it may be necessary to hack up and implementation of
> these changes, to flush out things that don't quite fit.
Have you considered how MPAM may want to deal with different memory "types"?
With SMBA there is a "CXL memory" resource while the MB resource has mostly
been "anything that misses L3". From a user space perspective it is not obvious
to me how users prefer to refer to different memory types.
>
>>
>>> The existing schema will keep its name (the same as the resource
>>> name), and new schemata defined for a resource will include that
>>> name as a prefix (at least, by default).
We may have to be explicit on expectations wrt which schema can be observed in
which area (schemata file vs new info hierarchy). resctrl.rst currently contains:
"schemata":
A list of all the resources available to this group.
With the above in existing documentation resctrl may be forced to always keep
existing schema/resource in the schemata file and be careful when considering to
drop them as mused in https://lore.kernel.org/lkml/aPkEb4CkJHZVDt0V@agluck-desk3/
Theoretically it may be possible in the future for it to vary which resources a
resource group may allocate. Consider for example when resources support different
numbers of CLOSID/PARTID and there is a desire to expose that to user space instead of
constraining all resource groups to lowest CLOSID/PARTID. In such a scenario it should
be clear to user space which resources it can allocate to a resource group so it is
reasonable to expect the existing documentation for "schemata" being "A list of all
the resources available to this group." to be respected.
On the flip side, it may not be required that a new schema in new info hierarchy always
appears in the schemata file. Reason I think this is after seeing in MPAM that
controls could be enabled/disabled (like MPAMCFG_MBW_PROP.EN for proportional-stride
partitioning).
resctrl may thus have support for more partitioning controls than what is exposed by
schemata file with ability for user space to choose which partitioning controls to expose
in schemata file to use to manage a resource. It may then turn out that in addition to
(read-only) schema "properties" there may also be (writable) schema "controls" (bad name
since this would "control" a "partitioning control") where user space can modify behavior
of a partitioning control.
>>>
>>> So, for example, we will have an MB resource with a schema called
>>> MB (the schema that we have already). But we may go on to define
>>> additional schemata for the MB resource, with names such MB_MAX,
>>> etc.
>>>
>>> * Stop adding new schema description information in the top-level
>>> info/<resource>/ directory in resctrlfs.
>>>
>>> For backwards compatibilty, we can keep the existing property
>>> files under the resource info directory to describe the previously
>>> defined resource, but we seem to need something richer going
>>> forward.
ack.
>>>
>>> * Add a hierarchy to list all the schemata for each resource, along
>>> with their properties. So far, the proposal looks like this,
>>> taking the MB resource as an example:
>>>
>>> info/
>>> └─ MB/
>>> └─ resource_schemata/
>>> ├─ MB/
>>> ├─ MB_MIN/
>>> ├─ MB_MAX/
>>> ┆
>>>
>>> Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
>>> In this proposal, what these just dummy schema names for
>>> illustration purposes. The important thing is that they all
>>> control aspects of the "MB" resource, and that there can be more
>>> than one of them.
>>>
>>> It may be appropriate to have a nested hierarchy, where some
>>> schemata are presented as children of other schemata if they
>>> affect the same hardware controls. For now, let's put this issue
>>> on one side, and consider what properties should be advertsed for
>>> each schema.
>>
>> ok to put this aside but I think we should keep including it, "open #2" ?
>
> Yes; I'm not abandoning this, but I wanted to focus on the schema
> description, here.
Understood. There may be some connection with this work if there is a hierarchy
since one schema's description may then be in terms of another. For example,
the relationships described via pseudocode in https://lore.kernel.org/lkml/aPJP52jXJvRYAjjV@e133380.arm.com/
As a sidenote (related to the '#' prefix discussion), while trying to understand how
this work may impact user expectations I did come across this in section
"Reading/writing the schemata file" of resctrl.rst:
When writing you only need to specify those values which you wish to change.
This seems quite close to addressing the concern raised in
https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ :
The reason why I think that this convention may be needed is that we
never told (old) userspace what it was supposed to do with schemata
entries that it does not recognise.
>>> * Current properties that I think we might want are:
>>>
>>> info/
>>> └─ SOME_RESOURCE/
>>> └─ resource_schemata/
>>> ├─ SOME_SCHEMA/
>>> ┆ ├─ type
>>> ├─ min
>>> ├─ max
>>> ├─ tolerance
>>> ├─ resolution
>>> ├─ scale
>>> └─ unit
>>>
>>> (I've tweaked the properties a bit since previous postings.
>>> "type" replaces "map"; "scale" is now the unit multiplier;
>>> "resolution" is now a scaling divisor -- details below.)
>>>
>>> I assume that we expose the properties in individual files, but we
>>> could also combine them into a single description file per schema,
>>> per resource or (possibly) a single global file.
>>> (I don't have a strong view on the best option.)
>>>
>>>
>>> Either way, the following set of properties may be a reasonable
>>> place to start:
>>>
>>>
>>> type: the schema type, followed by optional flag specifiers:
>>>
>>> - "scalar": a single-valued numeric control
>>>
>>> A mandatory flag indicates how the control value written to
>>> the schemata file is converted to an amount of resource for
>>> hardware regulation.
>>>
>>> The flag "linear" indicates a linear mapping.
>>>
>>> In this case, the amount of resource E that is actually
>>> allocated is derived from the control value C written to the
>>> schemata file as follows:
>>>
>>> E = C * scale * unit / resolution
>>>
>>> Other flags values could be defined later, if we encounter
>>> hardware with non-linear controls.
>>>
>>> - "bitmap": a bitmap control
>>>
>>> The optional flag "sparse" is present if the control accepts
>>> sparse bitmaps.
>>>
>>> In this case, E = bitmap_weight(C) * scale * unit / resolution.
>>>
>>> As before, each bit controls access to a specific chunk of
>>> resource in the hardware, such as a group of cache lines. All
>>> chunks are equally sized.
>>>
>>> (Different CTRL_MON groups may still contend within the
>>> allocation E, when they have bits in common between their
>>> bitmaps.)
>>
>> Would it not be simpler to have the files/properties depend on the
>> schema type? It almost seems as though some of the properties are forced
>> to have some meaning for bitmap when they do not seem to be needed. Instead,
>> for a bitmap type there can be bitmap specific properties like, for example,
>> bit_usage. This may also create more flexibility when there is a future
>> mapping function needed that depends on some new property?
>>
>> Reinette
>
> Sure, there is no reason why the set of properties has to be identical
> for different schema types.
>
> It turned out that a single set of properties fitted better than I
> expected, so I presented things that way to see what people thought
> about it.
>
> For bitmaps, there isn't a strong need to change the set of properties
> already available in the top-level info/ directories. These can be
> adopted into the new info under resource_schemata/, but I might be
> tempted to rename them to remove "cbm" string so that the names are
> applicable to all bitmap- style resources. I might also rename the
> min_cbm_bits property if we can think of a more intuitive name -- it's
> not obvious how this should apply to sparse bitmaps.
yes, this is a good time to rename things.
>
>
> Thinking about bit_usage, is that really per-schema?
Good point. This is per resource.
This may create complexity if multiple controls are available for a resource. For
example, if there is a MB resource with both a proportional schema and a max then
it sounds like it may be possible to program the proportional schema with 100% while
setting the max to 50%. On the hardware side these values may be legal, albeit with
unpredictable performance, but it will be difficult for resctrl to visualize the
"bit_usage" of such an allocation.
>
> If L3CODE and L3DATA are really allocating the same underlying
> resource, I wonder whether their bit_usage should be combined,
> somehow.
Related to earlier comment this is done internally by resctrl but not exposed to
user space. I earlier mentioned how exclusive groups take this into account, there
is also the bitmasks used when creating new resource groups. You will, for example,
find in __init_one_rdt_domain() that their bit usage is combined as below:
if (resctrl_arch_get_cdp_enabled(r->rid))
peer_ctl = resctrl_arch_get_config(r, d, i, peer_type);
else
peer_ctl = 0;
ctrl_val = resctrl_arch_get_config(r, d, i, s->conf_type);
used_b |= ctrl_val | peer_ctl;
>
> This might be one for later, though.
>
> It doesn't look necessary to adopt all existing properties into the
> extended schema description immediately -- if there are some that don't
> quite fit, we could adopt them later on without breaking backwards
> compatibilty.
It is not obvious to me that it will be simple to add a property to an
existing schema type. We may be forced to create new schema type when needing to
do so.
I also think there may be more schema types that will eventually need to be
supported, for example MPAM's priority partitioning?
Reinette
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-11-04 22:26 ` Reinette Chatre
@ 2025-11-06 17:45 ` Reinette Chatre
0 siblings, 0 replies; 11+ messages in thread
From: Reinette Chatre @ 2025-11-06 17:45 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Tony Luck, James Morse, Chen, Yu C, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, Drew Fustini
+Drew
On 11/4/25 2:26 PM, Reinette Chatre wrote:
> Hi Dave,
>
> On 10/30/25 9:36 AM, Dave Martin wrote:
>> Hi Reinette,
>>
>> On Tue, Oct 28, 2025 at 04:17:05PM -0700, Reinette Chatre wrote:
>>> Hi Dave,
>>>
>>> On 10/24/25 4:12 AM, Dave Martin wrote:
>>>> Hi all,
>>>>
>>>> Going forward, a single resctrl resource (such as memory bandwidth) is
>>>> likely to require multiple schemata, either because we want to add new
>>>> schemata that provide finer control, or because the hardware has
>>>> multiple controls, covering different aspects of resource allocation.
>>>>
>>>> The fit between MPAM's memory bandwidth controls and the resctrl MB
>>>> schema is already awkward, and later Intel RDT features such as Region
>>>> Aware Memory Bandwidth Allocation are already pushing past what the MB
>>>> schema can describe. Both of these can involve multiple control
>>>> values and finer resolution than the 100 steps offered by the current
>>>> "MB" schema.
>>>>
>>>> The previous discussion went off in a few different directions [1], so
>>>> I want to focus back onto defining an extended schema description that
>>>> aims to cover the use cases that we know about or anticipate today, and
>>>> allows for future extension as needed.
>>>>
>>>> (A separate discussion is needed on how new schemata interact with
>>>> previously-defined schemata (such as the MB percentage schema).
>>>> suggest we pause that discussion for now, in the interests of getting
>>>> the schema description nailed down.)
>>>
>>> ok, but let's keep this as "open #1"
>>>
>>>> Following on from the previous mail thread, I've tried to refine and
>>>> flesh out the proposal for schema descriptions a bit, as follows.
>>>>
>>>> Proposal:
>>>>
>>>> * Split resource names and schema names in resctrlfs.
>>>>
>>>> Resources will be named for the unique, existing schema for each
>>>> resource.
>>>
>>> Are you referring to the implementation or how things are exposed to user
>>> space? I am trying to understand how the existing L3CODE/L3DATA schemata
>>> fit in ... they are presented to user space as two separate resources since
>>> they each have their own directory in "info" while internally they are
>>> schema of the L3 resource.
>>
>> Good question -- I didn't take into account here the fact that some
>> physical resources already have multiple schemata exposed to userspace.
>>
>> I've probably overformalised, here. I'm not proposing to refactor the
>> arrangement of existing schemata and resources.
>>
>> So we would continue to have
>> info/L3CODE/resource_schemata/L3CODE/ and
>> info/L3DATA/resource_schemata/L3DATA/.
>>
>>
>> I think that the decision to combine these under a single resctrl
>> resource internally is the most logical one, but I'm proposing just to
>> extend the info/ content, without unnecssary changes.
>
> Thank you for confirming. This matches the way I was thinking about this work.
>
>>
>> The current arrangement does have one shortcoming, which is that
>> software doesn't know (other than by built-in knowledge) that L3CODE
>> and L3DATA claim resource from the same hardware pool, so
>>
>> L3CODE:0=0001
>> L3DATA:0=0001
>>
>> implies that the transactions on the I-side and D-side contend for
>> cache lines (unless there are separate L3 I- and D-caches -- but I
>> don't think that's a thing on any relevant system...)
>>
>> So, we might want some way to indicate that L3CODE and L3DATA are
>> linked. But I think that CDP is a unique case where we can reasonably
>> expect some built-in userspace knowledge.
>
> I'll admit that it is not as obvious as this new interface would make it be
> for new schemata but userspace is not entirely left to its own devices.
> resctrl will ensure that these resources do not overlap when, for example,
> a resource group is exclusive. For example, an L3CODE allocation in one
> resource group cannot be created to overlap with an L3DATA allocation in
> another when one of the resource groups is exclusive.
>
>>
>> I didn't currently plan to address this, but it could come later if we
>> think it's important.
>>
>>> Just trying to understand if you are talking about reverting
>>> https://lore.kernel.org/all/20210728170637.25610-1-james.morse@arm.com/ ?
>>
>> No...
>>
>>> The current implementation appears to match this proposal so we may need to
>>> have special cases to keep CDP backwards compatible.
>>>
>>> SMBA may also need some extra care ... especially if other architectures start
>>> to allocate memory bandwidth to CXL resource via their "MB" resource.
>>
>> Perhaps. I think it may be necessary to hack up and implementation of
>> these changes, to flush out things that don't quite fit.
>
> Have you considered how MPAM may want to deal with different memory "types"?
> With SMBA there is a "CXL memory" resource while the MB resource has mostly
> been "anything that misses L3". From a user space perspective it is not obvious
> to me how users prefer to refer to different memory types.
>
>>
>>>
>>>> The existing schema will keep its name (the same as the resource
>>>> name), and new schemata defined for a resource will include that
>>>> name as a prefix (at least, by default).
>
> We may have to be explicit on expectations wrt which schema can be observed in
> which area (schemata file vs new info hierarchy). resctrl.rst currently contains:
> "schemata":
> A list of all the resources available to this group.
> With the above in existing documentation resctrl may be forced to always keep
> existing schema/resource in the schemata file and be careful when considering to
> drop them as mused in https://lore.kernel.org/lkml/aPkEb4CkJHZVDt0V@agluck-desk3/
>
> Theoretically it may be possible in the future for it to vary which resources a
> resource group may allocate. Consider for example when resources support different
> numbers of CLOSID/PARTID and there is a desire to expose that to user space instead of
> constraining all resource groups to lowest CLOSID/PARTID. In such a scenario it should
> be clear to user space which resources it can allocate to a resource group so it is
> reasonable to expect the existing documentation for "schemata" being "A list of all
> the resources available to this group." to be respected.
>
> On the flip side, it may not be required that a new schema in new info hierarchy always
> appears in the schemata file. Reason I think this is after seeing in MPAM that
> controls could be enabled/disabled (like MPAMCFG_MBW_PROP.EN for proportional-stride
> partitioning).
>
> resctrl may thus have support for more partitioning controls than what is exposed by
> schemata file with ability for user space to choose which partitioning controls to expose
> in schemata file to use to manage a resource. It may then turn out that in addition to
> (read-only) schema "properties" there may also be (writable) schema "controls" (bad name
> since this would "control" a "partitioning control") where user space can modify behavior
> of a partitioning control.
>
>>>>
>>>> So, for example, we will have an MB resource with a schema called
>>>> MB (the schema that we have already). But we may go on to define
>>>> additional schemata for the MB resource, with names such MB_MAX,
>>>> etc.
>>>>
>>>> * Stop adding new schema description information in the top-level
>>>> info/<resource>/ directory in resctrlfs.
>>>>
>>>> For backwards compatibilty, we can keep the existing property
>>>> files under the resource info directory to describe the previously
>>>> defined resource, but we seem to need something richer going
>>>> forward.
>
> ack.
>
>>>>
>>>> * Add a hierarchy to list all the schemata for each resource, along
>>>> with their properties. So far, the proposal looks like this,
>>>> taking the MB resource as an example:
>>>>
>>>> info/
>>>> └─ MB/
>>>> └─ resource_schemata/
>>>> ├─ MB/
>>>> ├─ MB_MIN/
>>>> ├─ MB_MAX/
>>>> ┆
>>>>
>>>> Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
>>>> In this proposal, what these just dummy schema names for
>>>> illustration purposes. The important thing is that they all
>>>> control aspects of the "MB" resource, and that there can be more
>>>> than one of them.
>>>>
>>>> It may be appropriate to have a nested hierarchy, where some
>>>> schemata are presented as children of other schemata if they
>>>> affect the same hardware controls. For now, let's put this issue
>>>> on one side, and consider what properties should be advertsed for
>>>> each schema.
>>>
>>> ok to put this aside but I think we should keep including it, "open #2" ?
>>
>> Yes; I'm not abandoning this, but I wanted to focus on the schema
>> description, here.
>
> Understood. There may be some connection with this work if there is a hierarchy
> since one schema's description may then be in terms of another. For example,
> the relationships described via pseudocode in https://lore.kernel.org/lkml/aPJP52jXJvRYAjjV@e133380.arm.com/
>
> As a sidenote (related to the '#' prefix discussion), while trying to understand how
> this work may impact user expectations I did come across this in section
> "Reading/writing the schemata file" of resctrl.rst:
> When writing you only need to specify those values which you wish to change.
>
> This seems quite close to addressing the concern raised in
> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ :
> The reason why I think that this convention may be needed is that we
> never told (old) userspace what it was supposed to do with schemata
> entries that it does not recognise.
>
>>>> * Current properties that I think we might want are:
>>>>
>>>> info/
>>>> └─ SOME_RESOURCE/
>>>> └─ resource_schemata/
>>>> ├─ SOME_SCHEMA/
>>>> ┆ ├─ type
>>>> ├─ min
>>>> ├─ max
>>>> ├─ tolerance
>>>> ├─ resolution
>>>> ├─ scale
>>>> └─ unit
>>>>
>>>> (I've tweaked the properties a bit since previous postings.
>>>> "type" replaces "map"; "scale" is now the unit multiplier;
>>>> "resolution" is now a scaling divisor -- details below.)
>>>>
>>>> I assume that we expose the properties in individual files, but we
>>>> could also combine them into a single description file per schema,
>>>> per resource or (possibly) a single global file.
>>>> (I don't have a strong view on the best option.)
>>>>
>>>>
>>>> Either way, the following set of properties may be a reasonable
>>>> place to start:
>>>>
>>>>
>>>> type: the schema type, followed by optional flag specifiers:
>>>>
>>>> - "scalar": a single-valued numeric control
>>>>
>>>> A mandatory flag indicates how the control value written to
>>>> the schemata file is converted to an amount of resource for
>>>> hardware regulation.
>>>>
>>>> The flag "linear" indicates a linear mapping.
>>>>
>>>> In this case, the amount of resource E that is actually
>>>> allocated is derived from the control value C written to the
>>>> schemata file as follows:
>>>>
>>>> E = C * scale * unit / resolution
>>>>
>>>> Other flags values could be defined later, if we encounter
>>>> hardware with non-linear controls.
>>>>
>>>> - "bitmap": a bitmap control
>>>>
>>>> The optional flag "sparse" is present if the control accepts
>>>> sparse bitmaps.
>>>>
>>>> In this case, E = bitmap_weight(C) * scale * unit / resolution.
>>>>
>>>> As before, each bit controls access to a specific chunk of
>>>> resource in the hardware, such as a group of cache lines. All
>>>> chunks are equally sized.
>>>>
>>>> (Different CTRL_MON groups may still contend within the
>>>> allocation E, when they have bits in common between their
>>>> bitmaps.)
>>>
>>> Would it not be simpler to have the files/properties depend on the
>>> schema type? It almost seems as though some of the properties are forced
>>> to have some meaning for bitmap when they do not seem to be needed. Instead,
>>> for a bitmap type there can be bitmap specific properties like, for example,
>>> bit_usage. This may also create more flexibility when there is a future
>>> mapping function needed that depends on some new property?
>>>
>>> Reinette
>>
>> Sure, there is no reason why the set of properties has to be identical
>> for different schema types.
>>
>> It turned out that a single set of properties fitted better than I
>> expected, so I presented things that way to see what people thought
>> about it.
>>
>> For bitmaps, there isn't a strong need to change the set of properties
>> already available in the top-level info/ directories. These can be
>> adopted into the new info under resource_schemata/, but I might be
>> tempted to rename them to remove "cbm" string so that the names are
>> applicable to all bitmap- style resources. I might also rename the
>> min_cbm_bits property if we can think of a more intuitive name -- it's
>> not obvious how this should apply to sparse bitmaps.
>
> yes, this is a good time to rename things.
>
>>
>>
>> Thinking about bit_usage, is that really per-schema?
>
> Good point. This is per resource.
>
> This may create complexity if multiple controls are available for a resource. For
> example, if there is a MB resource with both a proportional schema and a max then
> it sounds like it may be possible to program the proportional schema with 100% while
> setting the max to 50%. On the hardware side these values may be legal, albeit with
> unpredictable performance, but it will be difficult for resctrl to visualize the
> "bit_usage" of such an allocation.
>
>>
>> If L3CODE and L3DATA are really allocating the same underlying
>> resource, I wonder whether their bit_usage should be combined,
>> somehow.
>
> Related to earlier comment this is done internally by resctrl but not exposed to
> user space. I earlier mentioned how exclusive groups take this into account, there
> is also the bitmasks used when creating new resource groups. You will, for example,
> find in __init_one_rdt_domain() that their bit usage is combined as below:
>
> if (resctrl_arch_get_cdp_enabled(r->rid))
> peer_ctl = resctrl_arch_get_config(r, d, i, peer_type);
> else
> peer_ctl = 0;
> ctrl_val = resctrl_arch_get_config(r, d, i, s->conf_type);
> used_b |= ctrl_val | peer_ctl;
>
>>
>> This might be one for later, though.
>>
>> It doesn't look necessary to adopt all existing properties into the
>> extended schema description immediately -- if there are some that don't
>> quite fit, we could adopt them later on without breaking backwards
>> compatibilty.
>
> It is not obvious to me that it will be simple to add a property to an
> existing schema type. We may be forced to create new schema type when needing to
> do so.
>
> I also think there may be more schema types that will eventually need to be
> supported, for example MPAM's priority partitioning?
>
> Reinette
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-10-24 11:12 [RFC] fs/resctrl: Generic schema description Dave Martin
2025-10-28 23:17 ` Reinette Chatre
@ 2025-11-10 12:37 ` Ben Horgan
2025-12-16 22:26 ` Reinette Chatre
2 siblings, 0 replies; 11+ messages in thread
From: Ben Horgan @ 2025-11-10 12:37 UTC (permalink / raw)
To: Dave Martin, linux-kernel
Cc: Tony Luck, Reinette Chatre, James Morse, Chen, Yu C,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, dfustini@baylibre.com
Hi Dave,
+Drew (Adding as Reinette added you for the other fork of the
discussion. This fork is here as the tolerance section was snipped from
the other replies.)
On 10/24/25 12:12, Dave Martin wrote:
> Hi all,
>
> Going forward, a single resctrl resource (such as memory bandwidth) is
> likely to require multiple schemata, either because we want to add new
> schemata that provide finer control, or because the hardware has
> multiple controls, covering different aspects of resource allocation.
>
> The fit between MPAM's memory bandwidth controls and the resctrl MB
> schema is already awkward, and later Intel RDT features such as Region
> Aware Memory Bandwidth Allocation are already pushing past what the MB
> schema can describe. Both of these can involve multiple control
> values and finer resolution than the 100 steps offered by the current
> "MB" schema.
>
> The previous discussion went off in a few different directions [1], so
> I want to focus back onto defining an extended schema description that
> aims to cover the use cases that we know about or anticipate today, and
> allows for future extension as needed.
>
> (A separate discussion is needed on how new schemata interact with
> previously-defined schemata (such as the MB percentage schema).
> suggest we pause that discussion for now, in the interests of getting
> the schema description nailed down.)
>
>
> Following on from the previous mail thread, I've tried to refine and
> flesh out the proposal for schema descriptions a bit, as follows.
>
> Proposal:
>
> * Split resource names and schema names in resctrlfs.
>
> Resources will be named for the unique, existing schema for each
> resource.
>
> The existing schema will keep its name (the same as the resource
> name), and new schemata defined for a resource will include that
> name as a prefix (at least, by default).
>
> So, for example, we will have an MB resource with a schema called
> MB (the schema that we have already). But we may go on to define
> additional schemata for the MB resource, with names such MB_MAX,
> etc.
>
> * Stop adding new schema description information in the top-level
> info/<resource>/ directory in resctrlfs.
>
> For backwards compatibilty, we can keep the existing property
> files under the resource info directory to describe the previously
> defined resource, but we seem to need something richer going
> forward.
>
> * Add a hierarchy to list all the schemata for each resource, along
> with their properties. So far, the proposal looks like this,
> taking the MB resource as an example:
>
> info/
> └─ MB/
> └─ resource_schemata/
> ├─ MB/
> ├─ MB_MIN/
> ├─ MB_MAX/
> ┆
>
> Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> In this proposal, what these just dummy schema names for
> illustration purposes. The important thing is that they all
> control aspects of the "MB" resource, and that there can be more
> than one of them.
>
> It may be appropriate to have a nested hierarchy, where some
> schemata are presented as children of other schemata if they
> affect the same hardware controls. For now, let's put this issue
> on one side, and consider what properties should be advertsed for
> each schema.
>
> * Current properties that I think we might want are:
>
> info/
> └─ SOME_RESOURCE/
> └─ resource_schemata/
> ├─ SOME_SCHEMA/
> ┆ ├─ type
> ├─ min
> ├─ max
> ├─ tolerance
> ├─ resolution
> ├─ scale
> └─ unit
>
> (I've tweaked the properties a bit since previous postings.
> "type" replaces "map"; "scale" is now the unit multiplier;
> "resolution" is now a scaling divisor -- details below.)
>
> I assume that we expose the properties in individual files, but we
> could also combine them into a single description file per schema,
> per resource or (possibly) a single global file.
> (I don't have a strong view on the best option.)
>
>
> Either way, the following set of properties may be a reasonable
> place to start:
>
>
> type: the schema type, followed by optional flag specifiers:
>
> - "scalar": a single-valued numeric control
>
> A mandatory flag indicates how the control value written to
> the schemata file is converted to an amount of resource for
> hardware regulation.
>
> The flag "linear" indicates a linear mapping.
>
> In this case, the amount of resource E that is actually
> allocated is derived from the control value C written to the
> schemata file as follows:
>
> E = C * scale * unit / resolution
>
> Other flags values could be defined later, if we encounter
> hardware with non-linear controls.
>
> - "bitmap": a bitmap control
>
> The optional flag "sparse" is present if the control accepts
> sparse bitmaps.
>
> In this case, E = bitmap_weight(C) * scale * unit / resolution.
>
> As before, each bit controls access to a specific chunk of
> resource in the hardware, such as a group of cache lines. All
> chunks are equally sized.
>
> (Different CTRL_MON groups may still contend within the
> allocation E, when they have bits in common between their
> bitmaps.)
The types "linear", "bitmap" and "scalar" and the way you've described
them make sense to me.
Do we also want to consider something like "weighted"?
In MPAM there is Memory-bandwidth proportional-stride partitioning which
when the bandwidth is saturated allocates bandwidth based on an inverse
weight, stride.
From: https://developer.arm.com/documentation/ihi0099/aa/ A.4.1
"""
In this model, each partition has an offset[p] that tracks the time
since the partition, p, consumed bandwidth but is bounded to be less
than offset_limit. When a request, r, arrives it is given a deadline, of
the current_time plus stride(p) minus offset(p). The offset(p) is set
to current_time – deadline, and the offset(p) is incremented in
event-time units until it reaches the offset_limit.
"""
>
> min:
>
> - For a scalar schema, the minimum value that can be written to
> the control when writing the schemata file.
>
> - For a bitmap schema, a bitmap of the minimum weight that the
> schema accepts: if an empty bitmap is accepted, this can be 0.
> Otherwise, if bitmaps with a single bit set are acceptable,
> this can just have the lowest-order bit set.
>
> Most commonly, the value will probably be "1".
>
> For bitmap schemata, we might report this in hex. In the
> interest of generic parsing, we could include a "0x" prefix if
> so.
>
> max:
>
> - For a scalar schema, the maximum value that can be written to
> the control when writing the schemata file.
>
> - For a bitmap schema, the mask with all bits set.
>
> Possibly reported in hex for bitmap schemata (as for "min").
>
> tolerance:
>
> (See below for discussion on this.)
>
> - "0": the control is exact
>
> - "1": the effective control value is within ±1 of the control
> value written to the schemata file. (Similary, positive "n" ->
> ±n.)
>
> A negative value could be used to indicate that the tolerance
> is unknown. (Possibly we could also just omit the property,
> though it seems better to warn userspace explicitly if we
> don't know.)
>
> Tests might make use of this parameter in order to determine
> how picky to be about exact measurement results.
>
> resolution:
>
> - For a proportional scalar schema: the number of divisions that
> the whole resource is divided into. (See below for
> "proportional scalar schema.)
>
> Typically, this will be the same as the "max" value.
>
> - For an absolute scalar schema: the divisor applied to the
> control value.
>
> - For a bitmap schema: the size of the bitmap in bits.
>
> scale:
>
> - For a scalar schema: the scale-up multiplier applied to
> "unit".
>
> - For a bitmap schema: probably "1".
>
> unit:
>
> - The base unit of the quantity measured by the control value.
>
> The special unit "all" denotes a proportional schema. In this
> case, the resource is a finite, physical thing such as a cache
> or maxed-out data throughput of a memory controller. The
> entire physical resource is available for allocation, and the
> control value indicates what proportion of it is allocated.
>
> Bitmap schemata will probably all be proportional and use the
> unit "all". (This applies to cache bitmaps, at least.)
>
> Absolute schemata will require specification of the base unit
> here, say, "MBps". The "scale" parameter can be used to avoid
> proliferation of unit strings:
>
> For example, {scale=1000, unit="MBps"} would be equivalent to
> {scale=1, unit="GBps"}.
>
>
> Note on the "tolerance" parameter:
>
> This is a new addition. On the MPAM side, the hardware has a choice
> about how to interpret the control value in some edge-case situations.
> We may not reasonably be able to probe for this, so it may be useful
> to warn software that there is an uncertainty margin.
>
> We might also be able to use the "tolerance" parameter to accommodate
> the rounding behaviour of the existing "MB" schema (otherwise, we
> might want a special "type" for this schema, if it doesn't comply
> closely enough).
Is "tolerance" referring to the number of bits in the hardware interface
or actually how good the hardware is at partitioning the bandwidth?
Say, if there are 16 bits of control in the hardware interface but
hardware only actually considers the 4 most significant does anything
different get set? Can we even know what the h/w does?
>
>
> If we want to deploy resctrl under virtualisation, resctrl on the host
> could dynamically affect the actual amount of resource that is
> available for allocation inside a VM.
For virtualization I think "tolerance" is somewhat tricky.
1. The guest may be given a non-power-of-2, e.g. third, of the resource
and so the tolerance will not strictly match to bits. This is similar to
the percentage to MB though.
2. The VM doesn't necessarily know the hardware it is running. If a vm
is migrated from one machine to another it could end up with more
bandwidth available but fewer bits to control that bandwidth.
How should "tolerance" be determined? Based on the actual hardware
interface or just on the emulated guest hardware interface.
>
> Whether or not we ever want to do that, it might be useful to have a
> way to warn software that the effective control values hitting the
> hardware may not be entirely predictable.
>
> Thoughts?
>
> Cheers
> ---Dave
>
>
> [1] Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
> https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-10-24 11:12 [RFC] fs/resctrl: Generic schema description Dave Martin
2025-10-28 23:17 ` Reinette Chatre
2025-11-10 12:37 ` Ben Horgan
@ 2025-12-16 22:26 ` Reinette Chatre
2025-12-26 10:38 ` Chen, Yu C
2026-01-24 18:09 ` Drew Fustini
2 siblings, 2 replies; 11+ messages in thread
From: Reinette Chatre @ 2025-12-16 22:26 UTC (permalink / raw)
To: Dave Martin, linux-kernel, Babu Moger, Fenghua Yu, fustini
Cc: Tony Luck, James Morse, Chen, Yu C, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86
Hi Babu and Fenghua,
Could you please consider how the new AMD and MPAM features [2] may benefit
from the new interfaces proposed here? More below ...
On 10/24/25 4:12 AM, Dave Martin wrote:
> Hi all,
>
> Going forward, a single resctrl resource (such as memory bandwidth) is
> likely to require multiple schemata, either because we want to add new
> schemata that provide finer control, or because the hardware has
> multiple controls, covering different aspects of resource allocation.
>
> The fit between MPAM's memory bandwidth controls and the resctrl MB
> schema is already awkward, and later Intel RDT features such as Region
> Aware Memory Bandwidth Allocation are already pushing past what the MB
> schema can describe. Both of these can involve multiple control
> values and finer resolution than the 100 steps offered by the current
> "MB" schema.
>
> The previous discussion went off in a few different directions [1], so
> I want to focus back onto defining an extended schema description that
> aims to cover the use cases that we know about or anticipate today, and
> allows for future extension as needed.
>
> (A separate discussion is needed on how new schemata interact with
> previously-defined schemata (such as the MB percentage schema).
> suggest we pause that discussion for now, in the interests of getting
> the schema description nailed down.)
>
>
> Following on from the previous mail thread, I've tried to refine and
> flesh out the proposal for schema descriptions a bit, as follows.
>
> Proposal:
>
> * Split resource names and schema names in resctrlfs.
>
> Resources will be named for the unique, existing schema for each
> resource.
>
> The existing schema will keep its name (the same as the resource
> name), and new schemata defined for a resource will include that
> name as a prefix (at least, by default).
>
> So, for example, we will have an MB resource with a schema called
> MB (the schema that we have already). But we may go on to define
> additional schemata for the MB resource, with names such MB_MAX,
> etc.
>
> * Stop adding new schema description information in the top-level
> info/<resource>/ directory in resctrlfs.
>
> For backwards compatibilty, we can keep the existing property
> files under the resource info directory to describe the previously
> defined resource, but we seem to need something richer going
> forward.
>
> * Add a hierarchy to list all the schemata for each resource, along
> with their properties. So far, the proposal looks like this,
> taking the MB resource as an example:
>
> info/
> └─ MB/
> └─ resource_schemata/
> ├─ MB/
> ├─ MB_MIN/
> ├─ MB_MAX/
> ┆
>
> Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> In this proposal, what these just dummy schema names for
> illustration purposes. The important thing is that they all
> control aspects of the "MB" resource, and that there can be more
> than one of them.
>
> It may be appropriate to have a nested hierarchy, where some
> schemata are presented as children of other schemata if they
> affect the same hardware controls. For now, let's put this issue
> on one side, and consider what properties should be advertsed for
> each schema.
>
> * Current properties that I think we might want are:
>
> info/
> └─ SOME_RESOURCE/
> └─ resource_schemata/
> ├─ SOME_SCHEMA/
> ┆ ├─ type
> ├─ min
> ├─ max
> ├─ tolerance
> ├─ resolution
> ├─ scale
> └─ unit
>
> (I've tweaked the properties a bit since previous postings.
> "type" replaces "map"; "scale" is now the unit multiplier;
> "resolution" is now a scaling divisor -- details below.)
>
> I assume that we expose the properties in individual files, but we
> could also combine them into a single description file per schema,
> per resource or (possibly) a single global file.
> (I don't have a strong view on the best option.)
>
>
> Either way, the following set of properties may be a reasonable
> place to start:
>
>
> type: the schema type, followed by optional flag specifiers:
>
> - "scalar": a single-valued numeric control
>
> A mandatory flag indicates how the control value written to
> the schemata file is converted to an amount of resource for
> hardware regulation.
>
> The flag "linear" indicates a linear mapping.
>
> In this case, the amount of resource E that is actually
> allocated is derived from the control value C written to the
> schemata file as follows:
>
> E = C * scale * unit / resolution
>
> Other flags values could be defined later, if we encounter
> hardware with non-linear controls.
>
> - "bitmap": a bitmap control
>
> The optional flag "sparse" is present if the control accepts
> sparse bitmaps.
>
> In this case, E = bitmap_weight(C) * scale * unit / resolution.
>
> As before, each bit controls access to a specific chunk of
> resource in the hardware, such as a group of cache lines. All
> chunks are equally sized.
>
> (Different CTRL_MON groups may still contend within the
> allocation E, when they have bits in common between their
> bitmaps.)
>
> min:
>
> - For a scalar schema, the minimum value that can be written to
> the control when writing the schemata file.
>
> - For a bitmap schema, a bitmap of the minimum weight that the
> schema accepts: if an empty bitmap is accepted, this can be 0.
> Otherwise, if bitmaps with a single bit set are acceptable,
> this can just have the lowest-order bit set.
>
> Most commonly, the value will probably be "1".
>
> For bitmap schemata, we might report this in hex. In the
> interest of generic parsing, we could include a "0x" prefix if
> so.
>
> max:
>
> - For a scalar schema, the maximum value that can be written to
> the control when writing the schemata file.
>
> - For a bitmap schema, the mask with all bits set.
>
> Possibly reported in hex for bitmap schemata (as for "min").
>
> tolerance:
>
> (See below for discussion on this.)
>
> - "0": the control is exact
>
> - "1": the effective control value is within ±1 of the control
> value written to the schemata file. (Similary, positive "n" ->
> ±n.)
>
> A negative value could be used to indicate that the tolerance
> is unknown. (Possibly we could also just omit the property,
> though it seems better to warn userspace explicitly if we
> don't know.)
>
> Tests might make use of this parameter in order to determine
> how picky to be about exact measurement results.
>
> resolution:
>
> - For a proportional scalar schema: the number of divisions that
> the whole resource is divided into. (See below for
> "proportional scalar schema.)
>
> Typically, this will be the same as the "max" value.
>
> - For an absolute scalar schema: the divisor applied to the
> control value.
>
> - For a bitmap schema: the size of the bitmap in bits.
>
> scale:
>
> - For a scalar schema: the scale-up multiplier applied to
> "unit".
>
> - For a bitmap schema: probably "1".
>
> unit:
>
> - The base unit of the quantity measured by the control value.
>
> The special unit "all" denotes a proportional schema. In this
> case, the resource is a finite, physical thing such as a cache
> or maxed-out data throughput of a memory controller. The
> entire physical resource is available for allocation, and the
> control value indicates what proportion of it is allocated.
>
> Bitmap schemata will probably all be proportional and use the
> unit "all". (This applies to cache bitmaps, at least.)
>
> Absolute schemata will require specification of the base unit
> here, say, "MBps". The "scale" parameter can be used to avoid
> proliferation of unit strings:
>
> For example, {scale=1000, unit="MBps"} would be equivalent to
> {scale=1, unit="GBps"}.
>
>
> Note on the "tolerance" parameter:
>
> This is a new addition. On the MPAM side, the hardware has a choice
> about how to interpret the control value in some edge-case situations.
> We may not reasonably be able to probe for this, so it may be useful
> to warn software that there is an uncertainty margin.
>
> We might also be able to use the "tolerance" parameter to accommodate
> the rounding behaviour of the existing "MB" schema (otherwise, we
> might want a special "type" for this schema, if it doesn't comply
> closely enough).
>
>
> If we want to deploy resctrl under virtualisation, resctrl on the host
> could dynamically affect the actual amount of resource that is
> available for allocation inside a VM.
>
> Whether or not we ever want to do that, it might be useful to have a
> way to warn software that the effective control values hitting the
> hardware may not be entirely predictable.
>
> Thoughts?
>
> Cheers
> ---Dave
One thing I was pondering is that resctrl currently uses L3 interchangeably
as a scope and a resource but if instead that is separated then it should be
easier to support interactions with resource at a different scope.
I am concerned that, for example, support for Global Memory Bandwidth Allocation
(GMBA) is planned to be done with a new resource. resctrl already has a
"memory bandwidth allocation" resource and introducing a new resource to essentially
manage the same resource, but at a different scope, sounds like a risk of fragmentation
and duplication to me.
What if the "resource control" instead gains a new property, for example, "scope" that
essentially communicates to user space what a domain ID in the schemata file means.
It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
like below:
info
└── SMBA
└── resource_schemata
├── SMBA
│ ├── max
│ ├── min
│ ├── resolution
│ ├── scale
│ ├── scope <== contains "L3"
│ ├── tolerance
│ ├── type
│ └── unit
└── SMBA_NODE
├── max
├── min
├── resolution
├── scale
├── scope <== contains "NODE"
├── tolerance
├── type
└── unit
With an interface like above there is a single resource and allocating it at a different
scope is just another control. This correlates to how other parts of resctrl is managed.
For example, it can become explicit that the monitor groups' mon_data directory contains
sub-directories organized by scope. For example:
mon_data
├── mon_L3_00 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_01 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_NODE_00 <== monitoring data at scope NODE
│ └── mbm_total_bytes
└── mon_NODE_01 <== monitoring data at scope NODE
└── mbm_total_bytes
What do you think?
Reinette
> [1] Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
> https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
[2] https://lpc.events/event/19/contributions/2093/attachments/1958/4172/resctrl%20Microconference%20LPC%202025%20Tokyo.pdf
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-12-16 22:26 ` Reinette Chatre
@ 2025-12-26 10:38 ` Chen, Yu C
2026-01-07 15:53 ` Dave Martin
2026-01-24 18:09 ` Drew Fustini
1 sibling, 1 reply; 11+ messages in thread
From: Chen, Yu C @ 2025-12-26 10:38 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, Fenghua Yu, Dave Martin
Cc: Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86, linux-kernel, fustini
Hi Reinette and all,
On 12/17/2025 6:26 AM, Reinette Chatre wrote:
> Hi Babu and Fenghua,
>
> Could you please consider how the new AMD and MPAM features [2] may benefit
> from the new interfaces proposed here? More below ...
>
> On 10/24/25 4:12 AM, Dave Martin wrote:
[snip]
>
> One thing I was pondering is that resctrl currently uses L3 interchangeably
> as a scope and a resource but if instead that is separated then it should be
> easier to support interactions with resource at a different scope.
>
> I am concerned that, for example, support for Global Memory Bandwidth Allocation
> (GMBA) is planned to be done with a new resource. resctrl already has a
> "memory bandwidth allocation" resource and introducing a new resource to essentially
> manage the same resource, but at a different scope, sounds like a risk of fragmentation
> and duplication to me.
>
> What if the "resource control" instead gains a new property, for example, "scope" that
> essentially communicates to user space what a domain ID in the schemata file means.
>
> It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> like below:
>
> info
> └── SMBA
> └── resource_schemata
> ├── SMBA
> │ ├── max
> │ ├── min
> │ ├── resolution
> │ ├── scale
> │ ├── scope <== contains "L3"
> │ ├── tolerance
> │ ├── type
> │ └── unit
> └── SMBA_NODE
> ├── max
> ├── min
> ├── resolution
> ├── scale
> ├── scope <== contains "NODE"
Would it be more user-friendly to explicitly show "node0, node1, ..."
rather than "NODE"? After all, we can already infer the "NODE" type from
the schemata name "SMBA_NODE".
> ├── tolerance
> ├── type
> └── unit
>
> With an interface like above there is a single resource and allocating it at a different
> scope is just another control. This correlates to how other parts of resctrl is managed.
> For example, it can become explicit that the monitor groups' mon_data directory contains
> sub-directories organized by scope. For example:
>
> mon_data
> ├── mon_L3_00 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_NODE_00 <== monitoring data at scope NODE
Does this mean the domain ID is "0", which corresponds to node0?
This seems to align with the presentation Fenghua's presentation at LPC,
where he mentioned that for CPU-less resctrl, the domain ID changes
from an L3 ID to a node ID.
> │ └── mbm_total_bytes
> └── mon_NODE_01 <== monitoring data at scope NODE
> └── mbm_total_bytes
>
Please let me take this chance to elaborate on region-aware RDT
in more detail. I am wondering if the interface could be further
extended to support this feature.
A "region" can be defined as a set of physical addresses that
belong to the same memory tier. The region ID is per socket
(i.e., unique within a single socket). Suppose we have a 2-socket
platform as follows:
S0: 1LM Direct DDR ==> NUMA node 0
CXL HDM (Tier2) ==> NUMA node 2
S1: 1LM Direct DDR ==> NUMA node 1
CXL HDM (Tier2) ==> NUMA node 3
region0 on S0 is node0, region1 on S0 is node2,
region0 on S1 is node1, region1 on S1 is node3.
Let us assume that each socket has 2 LLC domains.
For example, S0 has LLC domain0 and LLC domain1,
S1 has LLC domain2 and LLC domain3.
We propose the following schemata:
<resource name>_<region>_<control>
for example,
MB_REGION1_OPT:0=511;1=510;2=509;3=508
it means, for LLC domain0 on S0, the throttle
level for node2(because region1 on S0 is node2)
is 511. For LLC domain2 on S1, the throttle
level for node3(because region1 is node2 on
S1 is node3) is 509.
Users could query the exact definition of REGION1
by checking the info directory.
info
└── MB
└── resource_schemata
├── MB_REGION1_OPT
│ ├── max
│ ├── min
│ ├── resolution
│ ├── scale
│ ├── scope <== "0=node2;1=node3" (node2 on S0, node3 on S1)
│ ├── tolerance
│ ├── type
│ └── unit
thanks,
Chenyu
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-12-26 10:38 ` Chen, Yu C
@ 2026-01-07 15:53 ` Dave Martin
2026-01-09 16:09 ` Chen, Yu C
0 siblings, 1 reply; 11+ messages in thread
From: Dave Martin @ 2026-01-07 15:53 UTC (permalink / raw)
To: Chen, Yu C
Cc: Reinette Chatre, Babu Moger, Fenghua Yu, Tony Luck, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, linux-kernel, fustini
Hi,
On Fri, Dec 26, 2025 at 06:38:52PM +0800, Chen, Yu C wrote:
> Hi Reinette and all,
>
> On 12/17/2025 6:26 AM, Reinette Chatre wrote:
> > Hi Babu and Fenghua,
> >
> > Could you please consider how the new AMD and MPAM features [2] may benefit
> > from the new interfaces proposed here? More below ...
> >
> > On 10/24/25 4:12 AM, Dave Martin wrote:
>
> [snip]
>
> >
> > One thing I was pondering is that resctrl currently uses L3 interchangeably
> > as a scope and a resource but if instead that is separated then it should be
> > easier to support interactions with resource at a different scope.
> >
> > I am concerned that, for example, support for Global Memory Bandwidth Allocation
> > (GMBA) is planned to be done with a new resource. resctrl already has a
> > "memory bandwidth allocation" resource and introducing a new resource to essentially
> > manage the same resource, but at a different scope, sounds like a risk of fragmentation
> > and duplication to me.
> >
> > What if the "resource control" instead gains a new property, for example, "scope" that
> > essentially communicates to user space what a domain ID in the schemata file means.
> >
> > It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> > MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> > like below:
> >
> > info
> > └── SMBA
> > └── resource_schemata
> > ├── SMBA
> > │ ├── max
> > │ ├── min
> > │ ├── resolution
> > │ ├── scale
> > │ ├── scope <== contains "L3"
I guess we already have this confusion about domain IDs with monitoring
domains not necessarily being the same as control domains.
(The generic schema description does not try to address monitoring
domains, but the concept is still valid...)
"scope" seems a resaonable name.
What values would be expected here for the pre-existing schemata?
I'm thinking
"L2" for L2_foo schemata
"L3" for L3_foo
"L3" for MB (at least for the old MB schema)
Is it worth splitting out the level as a separate value? e.g.,
scope = "cache"
level = 3
Not all scopes will need a "level" parameter.
(This may not be sufficient for the region-aware case that Chenyu
outlines below.)
> > │ ├── tolerance
> > │ ├── type
> > │ └── unit
> > └── SMBA_NODE
> > ├── max
> > ├── min
> > ├── resolution
> > ├── scale
> > ├── scope <== contains "NODE"
>
> Would it be more user-friendly to explicitly show "node0, node1, ..."
> rather than "NODE"? After all, we can already infer the "NODE" type from
> the schemata name "SMBA_NODE".
I think that having an explicit declaration of the scope is probably
useful even for things that are included in the schema name.
Part of the reason for describing the schema explicitly is because
inferring everything from the name does not feel scalable as we add
more different schemata and resource types.
Having said that, the schema names should still provide a good clue as
to what the schema represents.
I'm not sure that we should simply list possible domain IDs here:
For MPAM, the domain IDs can be huge, random-looking numbers that do
not necessarily start from 0 (as currently implemented in the MPAM
driver).
In any case, we need not just names for the individual domain IDs, but
an idea of what they represent.
Maybe we could stick with opaque "scope" names as in Reinette's
proposal, and solve the problem of enumating the domain IDs separately.
For the commonly-used scopes, we probably don't need to bother, since
the enumeration is available elsewhere:
* for NUMA nodes and cache IDs, /sys/devices/system/node/node*
(or /sys/devices/system/node/possible) ?
* for cache IDs at level <n>, the set of values present in all the
files /sys/devices/system/cpu/cpu*/cache/index<n>/id ?
> > ├── tolerance
> > ├── type
> > └── unit
> >
> > With an interface like above there is a single resource and allocating it at a different
> > scope is just another control. This correlates to how other parts of resctrl is managed.
> > For example, it can become explicit that the monitor groups' mon_data directory contains
> > sub-directories organized by scope. For example:
> >
> > mon_data
> > ├── mon_L3_00 <== monitoring data at scope L3
> > │ ├── llc_occupancy
> > │ ├── mbm_local_bytes
> > │ └── mbm_total_bytes
> > ├── mon_L3_01 <== monitoring data at scope L3
> > │ ├── llc_occupancy
> > │ ├── mbm_local_bytes
> > │ └── mbm_total_bytes
> > ├── mon_NODE_00 <== monitoring data at scope NODE
>
> Does this mean the domain ID is "0", which corresponds to node0?
> This seems to align with the presentation Fenghua's presentation at LPC,
> where he mentioned that for CPU-less resctrl, the domain ID changes
> from an L3 ID to a node ID.
In an ideal world, we would have a generic description for the monitors.
Coming up with a "scope" concept that works for monitoring domains
feels like something we should aim for, even if we don't yet describe
this explicitly for monitors.
Then, we could say that mon_L3_00 has
scope = "cache"
level = 3
domain = 0
(assuming that the monitoring domain really does align with the cache
control domain).
>
> > │ └── mbm_total_bytes
> > └── mon_NODE_01 <== monitoring data at scope NODE
> > └── mbm_total_bytes
> >
>
> Please let me take this chance to elaborate on region-aware RDT
> in more detail. I am wondering if the interface could be further
> extended to support this feature.
>
> A "region" can be defined as a set of physical addresses that
> belong to the same memory tier. The region ID is per socket
> (i.e., unique within a single socket). Suppose we have a 2-socket
> platform as follows:
>
>
> S0: 1LM Direct DDR ==> NUMA node 0
> CXL HDM (Tier2) ==> NUMA node 2
> S1: 1LM Direct DDR ==> NUMA node 1
> CXL HDM (Tier2) ==> NUMA node 3
>
> region0 on S0 is node0, region1 on S0 is node2,
> region0 on S1 is node1, region1 on S1 is node3.
>
> Let us assume that each socket has 2 LLC domains.
> For example, S0 has LLC domain0 and LLC domain1,
> S1 has LLC domain2 and LLC domain3.
>
> We propose the following schemata:
> <resource name>_<region>_<control>
> for example,
> MB_REGION1_OPT:0=511;1=510;2=509;3=508
> it means, for LLC domain0 on S0, the throttle
> level for node2(because region1 on S0 is node2)
> is 511. For LLC domain2 on S1, the throttle
> level for node3(because region1 is node2 on
> S1 is node3) is 509.
>
> Users could query the exact definition of REGION1
> by checking the info directory.
>
> info
> └── MB
> └── resource_schemata
> ├── MB_REGION1_OPT
> │ ├── max
> │ ├── min
> │ ├── resolution
> │ ├── scale
> │ ├── scope <== "0=node2;1=node3" (node2 on S0, node3 on S1)
> │ ├── tolerance
> │ ├── type
> │ └── unit
>
>
> thanks,
> Chenyu
Hmmm, that's interesting.
If there is a grouping on NUMA nodes, is that advertised anywhere in
sysfs already?
Ideally, there would already be a definition of what "region 0" is in
terms of the NUMA topology, and we could just refer to it.
Cheers
---Dave
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2026-01-07 15:53 ` Dave Martin
@ 2026-01-09 16:09 ` Chen, Yu C
0 siblings, 0 replies; 11+ messages in thread
From: Chen, Yu C @ 2026-01-09 16:09 UTC (permalink / raw)
To: Dave Martin
Cc: Reinette Chatre, Babu Moger, Fenghua Yu, Tony Luck, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, linux-kernel, fustini
Hi Dave,
On 1/7/2026 11:53 PM, Dave Martin wrote:
> Hi,
>
> On Fri, Dec 26, 2025 at 06:38:52PM +0800, Chen, Yu C wrote:
[snip]
>
>>> │ ├── tolerance
>>> │ ├── type
>>> │ └── unit
>>> └── SMBA_NODE
>>> ├── max
>>> ├── min
>>> ├── resolution
>>> ├── scale
>>> ├── scope <== contains "NODE"
>>
>> Would it be more user-friendly to explicitly show "node0, node1, ..."
>> rather than "NODE"? After all, we can already infer the "NODE" type from
>> the schemata name "SMBA_NODE".
>
> I think that having an explicit declaration of the scope is probably
> useful even for things that are included in the schema name.
>
> Part of the reason for describing the schema explicitly is because
> inferring everything from the name does not feel scalable as we add
> more different schemata and resource types.
>
OK, this makes sense.
> Having said that, the schema names should still provide a good clue as
> to what the schema represents.
>
>
> I'm not sure that we should simply list possible domain IDs here:
>
> For MPAM, the domain IDs can be huge, random-looking numbers that do
> not necessarily start from 0 (as currently implemented in the MPAM
> driver).
>
> In any case, we need not just names for the individual domain IDs, but
> an idea of what they represent.
>
>
> Maybe we could stick with opaque "scope" names as in Reinette's
> proposal, and solve the problem of enumating the domain IDs separately.
>
>
> For the commonly-used scopes, we probably don't need to bother, since
> the enumeration is available elsewhere:
>
> * for NUMA nodes and cache IDs, /sys/devices/system/node/node*
> (or /sys/devices/system/node/possible) ?
>
> * for cache IDs at level <n>, the set of values present in all the
> files /sys/devices/system/cpu/cpu*/cache/index<n>/id ?
>
Previously, the node list display was proposed mainly to build a
connection between regions and nodes. If we know the node ID, we can
check the detailed information of that node via sysfs (such as
/sys/devices/system/node/node*).
But I agree that keeping the "scope" simple and displaying the
connection somewhere else is reasonable.
>>
>>> │ └── mbm_total_bytes
>>> └── mon_NODE_01 <== monitoring data at scope NODE
>>> └── mbm_total_bytes
>>>
>>
>> Please let me take this chance to elaborate on region-aware RDT
>> in more detail. I am wondering if the interface could be further
>> extended to support this feature.
>>
[snip]
>> Users could query the exact definition of REGION1
>> by checking the info directory.
>>
>> info
>> └── MB
>> └── resource_schemata
>> ├── MB_REGION1_OPT
>> │ ├── max
>> │ ├── min
>> │ ├── resolution
>> │ ├── scale
>> │ ├── scope <== "0=node2;1=node3" (node2 on S0, node3 on S1)
>> │ ├── tolerance
>> │ ├── type
>> │ └── unit
>>
>>
>
> Hmmm, that's interesting.
>
> If there is a grouping on NUMA nodes, is that advertised anywhere in
> sysfs already?
>
> Ideally, there would already be a definition of what "region 0" is in
> terms of the NUMA topology, and we could just refer to it.
>
We have a sysfs interface exposed in /sys/firmware/acpi/memory_ranges/;
each entry represents a physical address range with local or remote region
IDs. I think we can build based on this interface.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] fs/resctrl: Generic schema description
2025-12-16 22:26 ` Reinette Chatre
2025-12-26 10:38 ` Chen, Yu C
@ 2026-01-24 18:09 ` Drew Fustini
1 sibling, 0 replies; 11+ messages in thread
From: Drew Fustini @ 2026-01-24 18:09 UTC (permalink / raw)
To: Reinette Chatre
Cc: Dave Martin, linux-kernel, Babu Moger, Fenghua Yu, Tony Luck,
James Morse, Chen, Yu C, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86
On Tue, Dec 16, 2025 at 02:26:23PM -0800, Reinette Chatre wrote:
> Hi Babu and Fenghua,
>
> Could you please consider how the new AMD and MPAM features [2] may benefit
> from the new interfaces proposed here? More below ...
>
> On 10/24/25 4:12 AM, Dave Martin wrote:
> > Hi all,
> >
> > Going forward, a single resctrl resource (such as memory bandwidth) is
> > likely to require multiple schemata, either because we want to add new
> > schemata that provide finer control, or because the hardware has
> > multiple controls, covering different aspects of resource allocation.
> >
> > The fit between MPAM's memory bandwidth controls and the resctrl MB
> > schema is already awkward, and later Intel RDT features such as Region
> > Aware Memory Bandwidth Allocation are already pushing past what the MB
> > schema can describe. Both of these can involve multiple control
> > values and finer resolution than the 100 steps offered by the current
> > "MB" schema.
> >
> > The previous discussion went off in a few different directions [1], so
> > I want to focus back onto defining an extended schema description that
> > aims to cover the use cases that we know about or anticipate today, and
> > allows for future extension as needed.
> >
> > (A separate discussion is needed on how new schemata interact with
> > previously-defined schemata (such as the MB percentage schema).
> > suggest we pause that discussion for now, in the interests of getting
> > the schema description nailed down.)
> >
> >
> > Following on from the previous mail thread, I've tried to refine and
> > flesh out the proposal for schema descriptions a bit, as follows.
> >
> > Proposal:
> >
> > * Split resource names and schema names in resctrlfs.
> >
> > Resources will be named for the unique, existing schema for each
> > resource.
> >
> > The existing schema will keep its name (the same as the resource
> > name), and new schemata defined for a resource will include that
> > name as a prefix (at least, by default).
> >
> > So, for example, we will have an MB resource with a schema called
> > MB (the schema that we have already). But we may go on to define
> > additional schemata for the MB resource, with names such MB_MAX,
> > etc.
> >
> > * Stop adding new schema description information in the top-level
> > info/<resource>/ directory in resctrlfs.
> >
> > For backwards compatibilty, we can keep the existing property
> > files under the resource info directory to describe the previously
> > defined resource, but we seem to need something richer going
> > forward.
> >
> > * Add a hierarchy to list all the schemata for each resource, along
> > with their properties. So far, the proposal looks like this,
> > taking the MB resource as an example:
> >
> > info/
> > └─ MB/
> > └─ resource_schemata/
> > ├─ MB/
> > ├─ MB_MIN/
> > ├─ MB_MAX/
> > ┆
> >
> > Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
> > In this proposal, what these just dummy schema names for
> > illustration purposes. The important thing is that they all
> > control aspects of the "MB" resource, and that there can be more
> > than one of them.
> >
> > It may be appropriate to have a nested hierarchy, where some
> > schemata are presented as children of other schemata if they
> > affect the same hardware controls. For now, let's put this issue
> > on one side, and consider what properties should be advertsed for
> > each schema.
> >
> > * Current properties that I think we might want are:
> >
> > info/
> > └─ SOME_RESOURCE/
> > └─ resource_schemata/
> > ├─ SOME_SCHEMA/
> > ┆ ├─ type
> > ├─ min
> > ├─ max
> > ├─ tolerance
> > ├─ resolution
> > ├─ scale
> > └─ unit
> >
> > (I've tweaked the properties a bit since previous postings.
> > "type" replaces "map"; "scale" is now the unit multiplier;
> > "resolution" is now a scaling divisor -- details below.)
> >
> > I assume that we expose the properties in individual files, but we
> > could also combine them into a single description file per schema,
> > per resource or (possibly) a single global file.
> > (I don't have a strong view on the best option.)
> >
> >
> > Either way, the following set of properties may be a reasonable
> > place to start:
> >
> >
> > type: the schema type, followed by optional flag specifiers:
> >
> > - "scalar": a single-valued numeric control
> >
> > A mandatory flag indicates how the control value written to
> > the schemata file is converted to an amount of resource for
> > hardware regulation.
> >
> > The flag "linear" indicates a linear mapping.
> >
> > In this case, the amount of resource E that is actually
> > allocated is derived from the control value C written to the
> > schemata file as follows:
> >
> > E = C * scale * unit / resolution
> >
> > Other flags values could be defined later, if we encounter
> > hardware with non-linear controls.
> >
> > - "bitmap": a bitmap control
> >
> > The optional flag "sparse" is present if the control accepts
> > sparse bitmaps.
> >
> > In this case, E = bitmap_weight(C) * scale * unit / resolution.
> >
> > As before, each bit controls access to a specific chunk of
> > resource in the hardware, such as a group of cache lines. All
> > chunks are equally sized.
> >
> > (Different CTRL_MON groups may still contend within the
> > allocation E, when they have bits in common between their
> > bitmaps.)
> >
> > min:
> >
> > - For a scalar schema, the minimum value that can be written to
> > the control when writing the schemata file.
> >
> > - For a bitmap schema, a bitmap of the minimum weight that the
> > schema accepts: if an empty bitmap is accepted, this can be 0.
> > Otherwise, if bitmaps with a single bit set are acceptable,
> > this can just have the lowest-order bit set.
> >
> > Most commonly, the value will probably be "1".
> >
> > For bitmap schemata, we might report this in hex. In the
> > interest of generic parsing, we could include a "0x" prefix if
> > so.
> >
> > max:
> >
> > - For a scalar schema, the maximum value that can be written to
> > the control when writing the schemata file.
> >
> > - For a bitmap schema, the mask with all bits set.
> >
> > Possibly reported in hex for bitmap schemata (as for "min").
> >
> > tolerance:
> >
> > (See below for discussion on this.)
> >
> > - "0": the control is exact
> >
> > - "1": the effective control value is within ±1 of the control
> > value written to the schemata file. (Similary, positive "n" ->
> > ±n.)
> >
> > A negative value could be used to indicate that the tolerance
> > is unknown. (Possibly we could also just omit the property,
> > though it seems better to warn userspace explicitly if we
> > don't know.)
> >
> > Tests might make use of this parameter in order to determine
> > how picky to be about exact measurement results.
> >
> > resolution:
> >
> > - For a proportional scalar schema: the number of divisions that
> > the whole resource is divided into. (See below for
> > "proportional scalar schema.)
> >
> > Typically, this will be the same as the "max" value.
> >
> > - For an absolute scalar schema: the divisor applied to the
> > control value.
> >
> > - For a bitmap schema: the size of the bitmap in bits.
> >
> > scale:
> >
> > - For a scalar schema: the scale-up multiplier applied to
> > "unit".
> >
> > - For a bitmap schema: probably "1".
> >
> > unit:
> >
> > - The base unit of the quantity measured by the control value.
> >
> > The special unit "all" denotes a proportional schema. In this
> > case, the resource is a finite, physical thing such as a cache
> > or maxed-out data throughput of a memory controller. The
> > entire physical resource is available for allocation, and the
> > control value indicates what proportion of it is allocated.
> >
> > Bitmap schemata will probably all be proportional and use the
> > unit "all". (This applies to cache bitmaps, at least.)
> >
> > Absolute schemata will require specification of the base unit
> > here, say, "MBps". The "scale" parameter can be used to avoid
> > proliferation of unit strings:
> >
> > For example, {scale=1000, unit="MBps"} would be equivalent to
> > {scale=1, unit="GBps"}.
> >
> >
> > Note on the "tolerance" parameter:
> >
> > This is a new addition. On the MPAM side, the hardware has a choice
> > about how to interpret the control value in some edge-case situations.
> > We may not reasonably be able to probe for this, so it may be useful
> > to warn software that there is an uncertainty margin.
> >
> > We might also be able to use the "tolerance" parameter to accommodate
> > the rounding behaviour of the existing "MB" schema (otherwise, we
> > might want a special "type" for this schema, if it doesn't comply
> > closely enough).
> >
> >
> > If we want to deploy resctrl under virtualisation, resctrl on the host
> > could dynamically affect the actual amount of resource that is
> > available for allocation inside a VM.
> >
> > Whether or not we ever want to do that, it might be useful to have a
> > way to warn software that the effective control values hitting the
> > hardware may not be entirely predictable.
> >
> > Thoughts?
> >
> > Cheers
> > ---Dave
>
>
> One thing I was pondering is that resctrl currently uses L3 interchangeably
> as a scope and a resource but if instead that is separated then it should be
> easier to support interactions with resource at a different scope.
>
> I am concerned that, for example, support for Global Memory Bandwidth Allocation
> (GMBA) is planned to be done with a new resource. resctrl already has a
> "memory bandwidth allocation" resource and introducing a new resource to essentially
> manage the same resource, but at a different scope, sounds like a risk of fragmentation
> and duplication to me.
>
> What if the "resource control" instead gains a new property, for example, "scope" that
> essentially communicates to user space what a domain ID in the schemata file means.
>
> It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
> MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
> like below:
>
> info
> └── SMBA
> └── resource_schemata
> ├── SMBA
> │ ├── max
> │ ├── min
> │ ├── resolution
> │ ├── scale
> │ ├── scope <== contains "L3"
> │ ├── tolerance
> │ ├── type
> │ └── unit
> └── SMBA_NODE
> ├── max
> ├── min
> ├── resolution
> ├── scale
> ├── scope <== contains "NODE"
> ├── tolerance
> ├── type
> └── unit
>
> With an interface like above there is a single resource and allocating it at a different
> scope is just another control. This correlates to how other parts of resctrl is managed.
> For example, it can become explicit that the monitor groups' mon_data directory contains
> sub-directories organized by scope. For example:
>
> mon_data
> ├── mon_L3_00 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_NODE_00 <== monitoring data at scope NODE
> │ └── mbm_total_bytes
> └── mon_NODE_01 <== monitoring data at scope NODE
> └── mbm_total_bytes
>
> What do you think?
I think that the ability to have different scopes for a resource would
work well for QoS on RISC-V. The CBQRI spec [1] defines bandwidth
controller operations which can be anywhere in the system. I've been
having trouble trying to decide what to do about a CBQRI-enabled memory
controller as all bandwidth monitoring is currently assumed to be L3.
Therefore, my RFC series [2] that adds resctrl support for RISC-V does
not support bandwidth monitoring, but I think scope concept could make
it work.
Thanks,
Drew
[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0
[2] https://lore.kernel.org/all/20260119-ssqosid-cbqri-v1-0-aa2a75153832@kernel.org/
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-01-24 18:09 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24 11:12 [RFC] fs/resctrl: Generic schema description Dave Martin
2025-10-28 23:17 ` Reinette Chatre
2025-10-30 16:36 ` Dave Martin
2025-11-04 22:26 ` Reinette Chatre
2025-11-06 17:45 ` Reinette Chatre
2025-11-10 12:37 ` Ben Horgan
2025-12-16 22:26 ` Reinette Chatre
2025-12-26 10:38 ` Chen, Yu C
2026-01-07 15:53 ` Dave Martin
2026-01-09 16:09 ` Chen, Yu C
2026-01-24 18:09 ` Drew Fustini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox