* [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
@ 2025-09-02 16:24 Dave Martin
2025-09-12 22:19 ` Reinette Chatre
2025-09-22 15:04 ` Dave Martin
0 siblings, 2 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-02 16:24 UTC (permalink / raw)
To: linux-kernel
Cc: Tony Luck, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
The control value parser for the MB resource currently coerces the
memory bandwidth percentage value from userspace to be an exact
multiple of the bw_gran parameter.
On MPAM systems, this results in somewhat worse-than-worst-case
rounding, since bw_gran is in general only an approximation to the
actual hardware granularity on these systems, and the hardware
bandwidth allocation control value is not natively a percentage --
necessitating a further conversion in the resctrl_arch_update_domains()
path, regardless of the conversion done at parse time.
Allow the arch to provide its own parse-time conversion that is
appropriate for the hardware, and move the existing conversion to x86.
This will avoid accumulated error from rounding the value twice on MPAM
systems.
Clarify the documentation, but avoid overly exact promises.
Clamping to bw_min and bw_max still feels generic: leave it in the core
code, for now.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
Based on v6.17-rc3.
Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
the other tests except for the NONCONT_CAT tests, which do not seem to
be supported in my configuration -- and have nothing to do with the
code touched by this patch).
Notes:
I put the x86 version out of line in order to avoid having to move
struct rdt_resource and its dependencies into resctrl_types.h -- which
would create a lot of diff noise. Schemata writes from userspace have
a high overhead in any case.
For MPAM the conversion will be a no-op, because the incoming
percentage from the core resctrl code needs to be converted to hardware
representation in the driver anyway.
Perhaps _all_ the types should move to resctrl_types.h.
For now, I went for the smallest diffstat...
---
Documentation/filesystems/resctrl.rst | 7 +++----
arch/x86/include/asm/resctrl.h | 2 ++
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++
fs/resctrl/ctrlmondata.c | 2 +-
include/linux/resctrl.h | 6 ++++++
5 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index c7949dd44f2f..a1d0469d6dfb 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -143,12 +143,11 @@ with respect to allocation:
user can request.
"bandwidth_gran":
- The granularity in which the memory bandwidth
+ The approximate granularity in which the memory bandwidth
percentage is allocated. The allocated
b/w percentage is rounded off to the next
- control step available on the hardware. The
- available bandwidth control steps are:
- min_bandwidth + N * bandwidth_gran.
+ control step available on the hardware. The available
+ steps are at least as small as this value.
"delay_linear":
Indicates if the delay scale is linear or
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index feb93b50e990..8bec2b9cc503 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -18,6 +18,8 @@
*/
#define X86_RESCTRL_EMPTY_CLOSID ((u32)~0)
+struct rdt_resource;
+
/**
* struct resctrl_pqr_state - State cache for the PQR MSR
* @cur_rmid: The cached Resource Monitoring ID
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 1189c0df4ad7..cf9b30b5df3c 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -16,9 +16,15 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/cpu.h>
+#include <linux/math.h>
#include "internal.h"
+u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r)
+{
+ return roundup(val, (unsigned long)r->membw.bw_gran);
+}
+
int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type t, u32 cfg_val)
{
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index d98e0d2de09f..c5e73b75aaa0 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
return false;
}
- *data = roundup(bw, (unsigned long)r->membw.bw_gran);
+ *data = resctrl_arch_round_bw(bw, r);
return true;
}
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 6fb4894b8cfd..5b2a555cf2dd 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
+/*
+ * Round a bandwidth control value to the nearest value acceptable to
+ * the arch code for resource r:
+ */
+u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
+
/*
* Update the ctrl_val and apply this config right now.
* Must be called on one of the domain's CPUs.
base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
--
2.34.1
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin
@ 2025-09-12 22:19 ` Reinette Chatre
2025-09-22 14:39 ` Dave Martin
2025-09-22 15:04 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-09-12 22:19 UTC (permalink / raw)
To: Dave Martin, linux-kernel
Cc: Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86, linux-doc
Hi Dave,
nits:
Please use the subject prefix "x86,fs/resctrl" to be consistent with other
resctrl code (and was established by Arm :)).
Also please use upper case for acronym mba->MBA.
On 9/2/25 9:24 AM, Dave Martin wrote:
> The control value parser for the MB resource currently coerces the
> memory bandwidth percentage value from userspace to be an exact
> multiple of the bw_gran parameter.
(to help be specific)
"the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
>
> On MPAM systems, this results in somewhat worse-than-worst-case
> rounding, since bw_gran is in general only an approximation to the
> actual hardware granularity on these systems, and the hardware
> bandwidth allocation control value is not natively a percentage --
> necessitating a further conversion in the resctrl_arch_update_domains()
> path, regardless of the conversion done at parse time.
>
> Allow the arch to provide its own parse-time conversion that is
> appropriate for the hardware, and move the existing conversion to x86.
> This will avoid accumulated error from rounding the value twice on MPAM
> systems.
>
> Clarify the documentation, but avoid overly exact promises.
>
> Clamping to bw_min and bw_max still feels generic: leave it in the core
> code, for now.
Sounds like MPAM may be ready to start the schema parsing discussion again?
I understand that MPAM has a few more ways to describe memory bandwidth as
well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
schema format to user space, which seems like a good idea for new schema.
Is this something MPAM is still considering? For example, the minimum
and maximum ranges that can be specified, is this something you already
have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
discussion on how to handle min/max ranges for bandwidth?
>
> No functional change.
>
> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
>
> ---
>
> Based on v6.17-rc3.
>
> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> the other tests except for the NONCONT_CAT tests, which do not seem to
> be supported in my configuration -- and have nothing to do with the
> code touched by this patch).
Is the NONCONT_CAT test failing (i.e printing "not ok")?
The NONCONT_CAT tests may print error messages as debug information as part of
running, but these errors are expected as part of the test. The test should accurately
state whether it passed or failed though. For example, below attempts to write
a non-contiguous CBM to a system that does not support non-contiguous masks.
This fails as expected, error messages printed as debugging and thus the test passes
with an "ok".
# Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
# Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
ok 5 L3_NONCONT_CAT: test
>
> Notes:
>
> I put the x86 version out of line in order to avoid having to move
> struct rdt_resource and its dependencies into resctrl_types.h -- which
> would create a lot of diff noise. Schemata writes from userspace have
> a high overhead in any case.
Sounds good, I expect compiler will inline.
>
> For MPAM the conversion will be a no-op, because the incoming
> percentage from the core resctrl code needs to be converted to hardware
> representation in the driver anyway.
(addressed below)
>
> Perhaps _all_ the types should move to resctrl_types.h.
Can surely consider when there is a good motivation.
>
> For now, I went for the smallest diffstat...
>
> ---
> Documentation/filesystems/resctrl.rst | 7 +++----
> arch/x86/include/asm/resctrl.h | 2 ++
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++
> fs/resctrl/ctrlmondata.c | 2 +-
> include/linux/resctrl.h | 6 ++++++
> 5 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index c7949dd44f2f..a1d0469d6dfb 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -143,12 +143,11 @@ with respect to allocation:
> user can request.
>
> "bandwidth_gran":
> - The granularity in which the memory bandwidth
> + The approximate granularity in which the memory bandwidth
> percentage is allocated. The allocated
> b/w percentage is rounded off to the next
> - control step available on the hardware. The
> - available bandwidth control steps are:
> - min_bandwidth + N * bandwidth_gran.
> + control step available on the hardware. The available
> + steps are at least as small as this value.
A bit difficult to parse for me.
Is "at least as small as" same as "at least"?
Please note that the documentation has a section "Memory bandwidth Allocation
and monitoring" that also contains these exact promises.
>
> "delay_linear":
> Indicates if the delay scale is linear or
> diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> index feb93b50e990..8bec2b9cc503 100644
> --- a/arch/x86/include/asm/resctrl.h
> +++ b/arch/x86/include/asm/resctrl.h
> @@ -18,6 +18,8 @@
> */
> #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0)
>
> +struct rdt_resource;
> +
I'm missing something here. Why is this needed?
> /**
> * struct resctrl_pqr_state - State cache for the PQR MSR
> * @cur_rmid: The cached Resource Monitoring ID
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1189c0df4ad7..cf9b30b5df3c 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -16,9 +16,15 @@
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> #include <linux/cpu.h>
> +#include <linux/math.h>
>
> #include "internal.h"
>
> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r)
> +{
> + return roundup(val, (unsigned long)r->membw.bw_gran);
> +}
> +
> int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> u32 closid, enum resctrl_conf_type t, u32 cfg_val)
> {
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index d98e0d2de09f..c5e73b75aaa0 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> return false;
> }
>
> - *data = roundup(bw, (unsigned long)r->membw.bw_gran);
> + *data = resctrl_arch_round_bw(bw, r);
Please check that function comments remain accurate after changes (specifically
if making the conversion more generic as proposed below).
> return true;
> }
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 6fb4894b8cfd..5b2a555cf2dd 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>
> +/*
> + * Round a bandwidth control value to the nearest value acceptable to
> + * the arch code for resource r:
> + */
> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> +
I do not think that resctrl should make any assumptions on what the
architecture's conversion does (i.e "round"). That architecture needs to be
asked to "round a bandwidth control value" also sounds strange since resctrl really
should be able to do something like rounding itself. As I understand from
the notes this will be a no-op for MPAM making this even more confusing.
How about naming the helper something like resctrl_arch_convert_bw()?
(Open to other ideas of course).
If you make such a change, please check that subject of patch still fits.
I think that using const to pass data to architecture is great, thanks.
Reinette
[1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/
[2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/
[3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-12 22:19 ` Reinette Chatre
@ 2025-09-22 14:39 ` Dave Martin
2025-09-23 17:27 ` Reinette Chatre
2025-10-15 15:18 ` Dave Martin
0 siblings, 2 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-22 14:39 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
Thanks for the review.
On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> nits:
> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> resctrl code (and was established by Arm :)).
> Also please use upper case for acronym mba->MBA.
Ack (the local custom in the MPAM code is to use "mba", but arguably,
the meaning is not quite the same -- I'll change it.)
> On 9/2/25 9:24 AM, Dave Martin wrote:
> > The control value parser for the MB resource currently coerces the
> > memory bandwidth percentage value from userspace to be an exact
> > multiple of the bw_gran parameter.
>
> (to help be specific)
> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
"bw_gran" was intended as an informal shorthand for the abstract
parameter (exposed both in the field you mention and through the
bandiwidth_gran file in resctrl).
I can rewrite it as per your suggestion, but this could be read as
excluding the bandwidth_gran file. Would it make sense just to write
it out longhand? For now, I've rewritten it as follows:
| The control value parser for the MB resource currently coerces the
| memory bandwidth percentage value from userspace to be an exact
| multiple of the bandwidth granularity parameter.
|
| On MPAM systems, this results in somewhat worse-than-worst-case
| rounding, since the bandwidth granularity advertised to resctrl by the
| MPAM driver is in general only an approximation [...]
(I'm happy to go with your suggestion if you're not keen on this,
though.)
> > On MPAM systems, this results in somewhat worse-than-worst-case
> > rounding, since bw_gran is in general only an approximation to the
> > actual hardware granularity on these systems, and the hardware
> > bandwidth allocation control value is not natively a percentage --
> > necessitating a further conversion in the resctrl_arch_update_domains()
> > path, regardless of the conversion done at parse time.
> >
> > Allow the arch to provide its own parse-time conversion that is
> > appropriate for the hardware, and move the existing conversion to x86.
> > This will avoid accumulated error from rounding the value twice on MPAM
> > systems.
> >
> > Clarify the documentation, but avoid overly exact promises.
> >
> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
>
> Sounds like MPAM may be ready to start the schema parsing discussion again?
> I understand that MPAM has a few more ways to describe memory bandwidth as
> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> schema format to user space, which seems like a good idea for new schema.
My own ideas in this area are a little different, though I agree with
the general idea.
I'll respond separately on that, to avoid this thread getting off-topic.
For this patch, my aim was to avoid changing anything unnecessarily.
> Is this something MPAM is still considering? For example, the minimum
> and maximum ranges that can be specified, is this something you already
> have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> discussion on how to handle min/max ranges for bandwidth?
This is another thing that we probably do want to support at some point,
but it feels like a different thing from the minimum and maximum bounds
acceptable to an individual schema -- especially since in the hardware
they may behave more like trigger points than hard limits.
Again, I'll respond separately.
[...]
> > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> > the other tests except for the NONCONT_CAT tests, which do not seem to
> > be supported in my configuration -- and have nothing to do with the
> > code touched by this patch).
>
> Is the NONCONT_CAT test failing (i.e printing "not ok")?
>
> The NONCONT_CAT tests may print error messages as debug information as part of
> running, but these errors are expected as part of the test. The test should accurately
> state whether it passed or failed though. For example, below attempts to write
> a non-contiguous CBM to a system that does not support non-contiguous masks.
> This fails as expected, error messages printed as debugging and thus the test passes
> with an "ok".
>
> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
> ok 5 L3_NONCONT_CAT: test
I don't think that this was anything to do with my changes, but I don't
still seem to have the test output. (Since this test has to do with
bitmap schemata (?), it seemed unlikely to be affected by changes to
bw_validate().)
I'll need to re-test with and without this patch to check whether it
makes any difference.
> > Notes:
> >
> > I put the x86 version out of line in order to avoid having to move
> > struct rdt_resource and its dependencies into resctrl_types.h -- which
> > would create a lot of diff noise. Schemata writes from userspace have
> > a high overhead in any case.
>
> Sounds good, I expect compiler will inline.
The function and caller are in separate translation units, so unless
LTO is used, I don't think the function will be inlined.
> >
> > For MPAM the conversion will be a no-op, because the incoming
> > percentage from the core resctrl code needs to be converted to hardware
> > representation in the driver anyway.
>
> (addressed below)
>
> >
> > Perhaps _all_ the types should move to resctrl_types.h.
>
> Can surely consider when there is a good motivation.
>
> >
> > For now, I went for the smallest diffstat...
I'll assume the motivation is not strong enough for now, but shout if
you disagree.
[...]
> > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> > index c7949dd44f2f..a1d0469d6dfb 100644
> > --- a/Documentation/filesystems/resctrl.rst
> > +++ b/Documentation/filesystems/resctrl.rst
> > @@ -143,12 +143,11 @@ with respect to allocation:
> > user can request.
> >
> > "bandwidth_gran":
> > - The granularity in which the memory bandwidth
> > + The approximate granularity in which the memory bandwidth
> > percentage is allocated. The allocated
> > b/w percentage is rounded off to the next
> > - control step available on the hardware. The
> > - available bandwidth control steps are:
> > - min_bandwidth + N * bandwidth_gran.
> > + control step available on the hardware. The available
> > + steps are at least as small as this value.
>
> A bit difficult to parse for me.
> Is "at least as small as" same as "at least"?
It was supposed to mean: "The available steps are no larger than this
value."
Formally My expectation is that this value is the smallest integer
number of percent which is not smaller than the apparent size of any
individual rounding step. Equivalently, this is the smallest number g
for which writing "MB: 0=x" and "MB: 0=y" yield different
configurations for every in-range x and where y = x + g and y is also
in-range.
That's a bit of a mouthful, though. If you can think of a more
succinct way of putting it, I'm open to suggestions!
> Please note that the documentation has a section "Memory bandwidth Allocation
> and monitoring" that also contains these exact promises.
Hmmm, somehow I completely missed that.
Does the following make sense? Ideally, there would be a simpler way
to describe the discrepancy between the reported and actual values of
bw_gran...
| Memory bandwidth Allocation and monitoring
| ==========================================
|
| [...]
|
| The minimum bandwidth percentage value for each cpu model is predefined
| and can be looked up through "info/MB/min_bandwidth". The bandwidth
| granularity that is allocated is also dependent on the cpu model and can
| be looked up at "info/MB/bandwidth_gran". The available bandwidth
| -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
| -to the next control step available on the hardware.
| +control steps are: min_bw + N * (bw_gran - e), where e is a
| +non-negative, hardware-defined real constant that is less than 1.
| +Intermediate values are rounded to the next control step available on
| +the hardware.
| +
| +At the time of writing, the constant e referred to in the preceding
| +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
| +describes the step size exactly), but this may not be the case on other
| +hardware when the actual granularity is not an exact divisor of 100.
>
> >
> > "delay_linear":
> > Indicates if the delay scale is linear or
> > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
> > index feb93b50e990..8bec2b9cc503 100644
> > --- a/arch/x86/include/asm/resctrl.h
> > +++ b/arch/x86/include/asm/resctrl.h
> > @@ -18,6 +18,8 @@
> > */
> > #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0)
> >
> > +struct rdt_resource;
> > +
>
> I'm missing something here. Why is this needed?
Oops, it's not. This got left behind from when I had the function
in-line here.
Removed.
[...]
> > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > index d98e0d2de09f..c5e73b75aaa0 100644
> > --- a/fs/resctrl/ctrlmondata.c
> > +++ b/fs/resctrl/ctrlmondata.c
> > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> > return false;
> > }
> >
> > - *data = roundup(bw, (unsigned long)r->membw.bw_gran);
> > + *data = resctrl_arch_round_bw(bw, r);
>
> Please check that function comments remain accurate after changes (specifically
> if making the conversion more generic as proposed below).
I hoped that the comment for this function was still applicable, though
it can probably be improved. How about the following?
| - * hardware. The allocated bandwidth percentage is rounded to the next
| - * control step available on the hardware.
| + * hardware. The allocated bandwidth percentage is converted as
| + * appropriate for consumption by the specific hardware driver.
[...]
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 6fb4894b8cfd..5b2a555cf2dd 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> > bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> > int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> >
> > +/*
> > + * Round a bandwidth control value to the nearest value acceptable to
> > + * the arch code for resource r:
> > + */
> > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> > +
>
> I do not think that resctrl should make any assumptions on what the
> architecture's conversion does (i.e "round"). That architecture needs to be
> asked to "round a bandwidth control value" also sounds strange since resctrl really
> should be able to do something like rounding itself. As I understand from
> the notes this will be a no-op for MPAM making this even more confusing.
>
> How about naming the helper something like resctrl_arch_convert_bw()?
> (Open to other ideas of course).
>
> If you make such a change, please check that subject of patch still fits.
I struggled a bit with the name. Really, this is converting the value
to an intermediate form (which might or might not involve rounding).
For historical reasons, this is a value suitable for writing directly
to the relevant x86 MSR without any further interpretation.
For MPAM, it is convenient to do this conversion later, rather than
during parsing of the value.
Would a name like resctrl_arch_preconvert_bw() be acceptable?
This isn't more informative than your suggestion regarding what the
conversion is expected to do, but may convey the expectation that the
output value may still not be in its final (i.e., hardware) form.
> I think that using const to pass data to architecture is great, thanks.
>
> Reinette
I try to constify by default when straightforward to do so, since the
compiler can then find which cases need to change; the reverse
direction is harder to automate...
Cheers
---Dave
[...]
> [1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/
> [2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/
> [3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin
2025-09-12 22:19 ` Reinette Chatre
@ 2025-09-22 15:04 ` Dave Martin
2025-09-25 22:58 ` Luck, Tony
2025-09-26 20:54 ` Reinette Chatre
1 sibling, 2 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-22 15:04 UTC (permalink / raw)
To: linux-kernel
Cc: Tony Luck, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi again,
On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
[...]
> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
>
> Sounds like MPAM may be ready to start the schema parsing discussion again?
> I understand that MPAM has a few more ways to describe memory bandwidth as
> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> schema format to user space, which seems like a good idea for new schema.
On this topic, specifically:
My own ideas in this area are a little different, though I agree with
the general idea.
Bitmap controls are distinct from numeric values, but for numbers, I'm
not sure that distinguishing percentages from other values is required,
since this is really just a specific case of a linear scale.
I imagined a generic numeric schema, described by a set of files like
the following in a schema's info directory:
min: minimum value, e.g., 1
max: maximum value, e.g., 1023
scale: value that corresponds to one unit
unit: quantified base unit, e.g., "100pc", "64MBps"
map: mapping function name
If s is the value written in a schemata entry and p is the
corresponding physical amount of resource, then
min <= s <= max
and
p = map(s / scale) * unit
One reason why I prefer this scaling scheme over the floating-point
approach is that it can be exact (at least for currently known
platforms), and it doesn't require a new floating-point parser/
formatter to be written for this one thing in the kernel (which I
suspect is likely to be error-prone and poorly defined around
subtleties such as rounding behaviour).
"map" anticipates non-linear ramps, but this is only really here as a
forwards compatibility get-out. For now, this might just be set to
"none", meaning the identity mapping (i.e., a no-op). This may shadow
the existing the "delay_linear" parameter, but with more general
applicabillity if we need it.
The idea is that userspace reads the info files and then does the
appropriate conversions itself. This might or might not be seen as a
burden, but would give exact control over the hardware configuration
with a generic interface, with possibly greater precision than the
existing schemata allow (when the hardware supports it), and without
having to second-guess the rounding that the kernel may or may not do
on the values.
For RDT MBA, we might have
min: 10
max: 100
scale: 100
unit: 100pc
map: none
The schemata entry
MB: 0=10, 1=100
would allocate the minimum possible bandwidth to domain 0, and 100%
bandwidth to domain 1.
For AMD SMBA, we might have:
min: 1
max: 100
scale: 8
unit: 1GBps
(if I've understood this correctly from resctrl.rst.)
For MPAM MBW_MAX with, say, 6 bits of resolution, we might have:
min: 1
max: 64
scale: 64
unit: 100pc
map: none
The schemata entry
MB: 0=1,1=64
would allocate the minimum possible bandwidth to domain 0, and 100%
bandwidth to domain 1. This would probably need to be a new schema,
since we already have "MB" mimicking x86.
Exposing the hardware scale in this way would give userspace precise
control (including in sub-1% increments on capable hardware), without
having to second-guess the way the kernel will round the values.
> Is this something MPAM is still considering? For example, the minimum
> and maximum ranges that can be specified, is this something you already
> have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> discussion on how to handle min/max ranges for bandwidth?
This seems to be a different thing. I think James had some thoughts on
this already -- I haven't checked on his current idea, but one option
would be simply to expose this as two distinct schemata, say MB_MIN,
MB_MAX.
There's a question of how to cope with multiple different schemata
entries that shadow each other (i.e., control the same hardware
resource).
Would something like the following work? A read from schemata might
produce something like this:
MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32
(Where MB_HW is the MPAM schema with 6-bit resolution that I
illustrated above, and MB_MIN and MB_MAX are similar schemata for the
specific MIN and MAX controls in the hardware.)
Userspace that does not understand the new entries would need to ignore
the commented lines, but can otherwise safely alter and write back the
schemata with the expected results. The kernel would in turn ignore
the commented lines on write. The commented lines are meaningful but
"inactive": they describe the current hardware configuration on read,
but (unless explicitly uncommented) won't change anything on write.
Software that understands the new entries can uncomment the conflicting
entries and write them back instead of (or in addition to) the
conflicting entries. For example, userspace might write the following:
MB_MIN: 0=16, 1=16
MB_MAX: 0=32, 1=32
Which might then read back as follows:
MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=16, 1=16
# MB_MAX: 0=32, 1=32
I haven't tried to develop this idea further, for now.
I'd be interested in people's thoughts on it, though.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-22 14:39 ` Dave Martin
@ 2025-09-23 17:27 ` Reinette Chatre
2025-09-25 12:46 ` Dave Martin
2025-10-15 15:18 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-09-23 17:27 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 9/22/25 7:39 AM, Dave Martin wrote:
> Hi Reinette,
>
> Thanks for the review.
>
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> nits:
>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
>> resctrl code (and was established by Arm :)).
>> Also please use upper case for acronym mba->MBA.
>
> Ack (the local custom in the MPAM code is to use "mba", but arguably,
> the meaning is not quite the same -- I'll change it.)
I am curious what the motivation is for the custom? Knowing this will help
me to keep things consistent when the two worlds meet.
>
>> On 9/2/25 9:24 AM, Dave Martin wrote:
>>> The control value parser for the MB resource currently coerces the
>>> memory bandwidth percentage value from userspace to be an exact
>>> multiple of the bw_gran parameter.
>>
>> (to help be specific)
>> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
>
> "bw_gran" was intended as an informal shorthand for the abstract
> parameter (exposed both in the field you mention and through the
> bandiwidth_gran file in resctrl).
I do not see a need for being abstract since the bandwidth_gran file exposes
the field verbatim.
>
> I can rewrite it as per your suggestion, but this could be read as
> excluding the bandwidth_gran file. Would it make sense just to write
> it out longhand? For now, I've rewritten it as follows:
Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran
it is not clear to me how being specific excludes the bandwidth_gran file.
>
> | The control value parser for the MB resource currently coerces the
> | memory bandwidth percentage value from userspace to be an exact
> | multiple of the bandwidth granularity parameter.
If you want to include the bandwidth_gran file then the above could be
something like:
The control value parser for the MB resource coerces the memory
bandwidth percentage value from userspace to be an exact multiple
of the bandwidth granularity parameter that is exposed by the
bandwidth_gran resctrl file.
I still think that replacing "the bandwidth granularity parameter" with
"rdt_resource::resctrl_membw::bw_gran" will help to be more specific.
> |
> | On MPAM systems, this results in somewhat worse-than-worst-case
> | rounding, since the bandwidth granularity advertised to resctrl by the
> | MPAM driver is in general only an approximation [...]
>
> (I'm happy to go with your suggestion if you're not keen on this,
> though.)
>
>>> On MPAM systems, this results in somewhat worse-than-worst-case
>>> rounding, since bw_gran is in general only an approximation to the
>>> actual hardware granularity on these systems, and the hardware
>>> bandwidth allocation control value is not natively a percentage --
>>> necessitating a further conversion in the resctrl_arch_update_domains()
>>> path, regardless of the conversion done at parse time.
>>>
>>> Allow the arch to provide its own parse-time conversion that is
>>> appropriate for the hardware, and move the existing conversion to x86.
>>> This will avoid accumulated error from rounding the value twice on MPAM
>>> systems.
>>>
>>> Clarify the documentation, but avoid overly exact promises.
>>>
>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>> code, for now.
>>
>> Sounds like MPAM may be ready to start the schema parsing discussion again?
>> I understand that MPAM has a few more ways to describe memory bandwidth as
>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
>> schema format to user space, which seems like a good idea for new schema.
>
> My own ideas in this area are a little different, though I agree with
> the general idea.
Should we expect a separate proposal from James?
>
> I'll respond separately on that, to avoid this thread getting off-topic.
Much appreciated.
>
> For this patch, my aim was to avoid changing anything unnecessarily.
Understood. More below as I try to understand the details but it does not
really sound as though the current interface works that great for MPAM. If I
understand correctly this patch enables MPAM to use existing interface for
its memory bandwidth allocations but doing so does not enable users to
obtain benefit of hardware capabilities. For that users would want to use
the new interface?
>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
>>> the other tests except for the NONCONT_CAT tests, which do not seem to
>>> be supported in my configuration -- and have nothing to do with the
>>> code touched by this patch).
>>
>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
>>
>> The NONCONT_CAT tests may print error messages as debug information as part of
>> running, but these errors are expected as part of the test. The test should accurately
>> state whether it passed or failed though. For example, below attempts to write
>> a non-contiguous CBM to a system that does not support non-contiguous masks.
>> This fails as expected, error messages printed as debugging and thus the test passes
>> with an "ok".
>>
>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
>> ok 5 L3_NONCONT_CAT: test
>
> I don't think that this was anything to do with my changes, but I don't
> still seem to have the test output. (Since this test has to do with
> bitmap schemata (?), it seemed unlikely to be affected by changes to
> bw_validate().)
I agree that this should not have anything to do with this patch. My concern
is that I understood that the test failed for a feature that is not supported.
If this is the case then there may be a problem with the test. The test should
not fail if the feature is not supported but instead skip the test.
>
>>> Notes:
>>>
>>> I put the x86 version out of line in order to avoid having to move
>>> struct rdt_resource and its dependencies into resctrl_types.h -- which
>>> would create a lot of diff noise. Schemata writes from userspace have
>>> a high overhead in any case.
>>
>> Sounds good, I expect compiler will inline.
>
> The function and caller are in separate translation units, so unless
> LTO is used, I don't think the function will be inlined.
Thanks, yes, indeed.
>
>>>
>>> For MPAM the conversion will be a no-op, because the incoming
>>> percentage from the core resctrl code needs to be converted to hardware
>>> representation in the driver anyway.
>>
>> (addressed below)
>>
>>>
>>> Perhaps _all_ the types should move to resctrl_types.h.
>>
>> Can surely consider when there is a good motivation.
>>
>>>
>>> For now, I went for the smallest diffstat...
>
> I'll assume the motivation is not strong enough for now, but shout if
> you disagree.
I agree.
>
> [...]
>
>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>>> index c7949dd44f2f..a1d0469d6dfb 100644
>>> --- a/Documentation/filesystems/resctrl.rst
>>> +++ b/Documentation/filesystems/resctrl.rst
>>> @@ -143,12 +143,11 @@ with respect to allocation:
>>> user can request.
>>>
>>> "bandwidth_gran":
>>> - The granularity in which the memory bandwidth
>>> + The approximate granularity in which the memory bandwidth
>>> percentage is allocated. The allocated
>>> b/w percentage is rounded off to the next
>>> - control step available on the hardware. The
>>> - available bandwidth control steps are:
>>> - min_bandwidth + N * bandwidth_gran.
>>> + control step available on the hardware. The available
>>> + steps are at least as small as this value.
>>
>> A bit difficult to parse for me.
>> Is "at least as small as" same as "at least"?
>
> It was supposed to mean: "The available steps are no larger than this
> value."
This is clear to me, especially when compared with the planned addition to
"Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
the paragraph below (more below).
>
> Formally My expectation is that this value is the smallest integer
> number of percent which is not smaller than the apparent size of any
> individual rounding step. Equivalently, this is the smallest number g
Considering the two statements:
- "The available steps are no larger than this value."
- "this value ... is not smaller than the apparent size of any individual rounding step"
The "not larger" and "not smaller" sounds like all these words just end up saying that
this is the step size?
> for which writing "MB: 0=x" and "MB: 0=y" yield different
> configurations for every in-range x and where y = x + g and y is also
> in-range.
>
> That's a bit of a mouthful, though. If you can think of a more
> succinct way of putting it, I'm open to suggestions!
>
>> Please note that the documentation has a section "Memory bandwidth Allocation
>> and monitoring" that also contains these exact promises.
>
> Hmmm, somehow I completely missed that.
>
> Does the following make sense? Ideally, there would be a simpler way
> to describe the discrepancy between the reported and actual values of
> bw_gran...
>
> | Memory bandwidth Allocation and monitoring
> | ==========================================
> |
> | [...]
> |
> | The minimum bandwidth percentage value for each cpu model is predefined
> | and can be looked up through "info/MB/min_bandwidth". The bandwidth
> | granularity that is allocated is also dependent on the cpu model and can
> | be looked up at "info/MB/bandwidth_gran". The available bandwidth
> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> | -to the next control step available on the hardware.
> | +control steps are: min_bw + N * (bw_gran - e), where e is a
> | +non-negative, hardware-defined real constant that is less than 1.
> | +Intermediate values are rounded to the next control step available on
> | +the hardware.
> | +
> | +At the time of writing, the constant e referred to in the preceding
> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> | +describes the step size exactly), but this may not be the case on other
> | +hardware when the actual granularity is not an exact divisor of 100.
Have you considered how to share the value of "e" with users?
>>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
>>> index d98e0d2de09f..c5e73b75aaa0 100644
>>> --- a/fs/resctrl/ctrlmondata.c
>>> +++ b/fs/resctrl/ctrlmondata.c
>>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
>>> return false;
>>> }
>>>
>>> - *data = roundup(bw, (unsigned long)r->membw.bw_gran);
>>> + *data = resctrl_arch_round_bw(bw, r);
>>
>> Please check that function comments remain accurate after changes (specifically
>> if making the conversion more generic as proposed below).
>
> I hoped that the comment for this function was still applicable, though
> it can probably be improved. How about the following?
>
> | - * hardware. The allocated bandwidth percentage is rounded to the next
> | - * control step available on the hardware.
> | + * hardware. The allocated bandwidth percentage is converted as
> | + * appropriate for consumption by the specific hardware driver.
>
> [...]
Looks good to me.
>
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 6fb4894b8cfd..5b2a555cf2dd 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
>>> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
>>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>>>
>>> +/*
>>> + * Round a bandwidth control value to the nearest value acceptable to
>>> + * the arch code for resource r:
>>> + */
>>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
>>> +
>>
>> I do not think that resctrl should make any assumptions on what the
>> architecture's conversion does (i.e "round"). That architecture needs to be
>> asked to "round a bandwidth control value" also sounds strange since resctrl really
>> should be able to do something like rounding itself. As I understand from
>> the notes this will be a no-op for MPAM making this even more confusing.
>>
>> How about naming the helper something like resctrl_arch_convert_bw()?
>> (Open to other ideas of course).
>>
>> If you make such a change, please check that subject of patch still fits.
>
> I struggled a bit with the name. Really, this is converting the value
> to an intermediate form (which might or might not involve rounding).
> For historical reasons, this is a value suitable for writing directly
> to the relevant x86 MSR without any further interpretation.
>
> For MPAM, it is convenient to do this conversion later, rather than
> during parsing of the value.
>
>
> Would a name like resctrl_arch_preconvert_bw() be acceptable?
Yes.
>
> This isn't more informative than your suggestion regarding what the
> conversion is expected to do, but may convey the expectation that the
> output value may still not be in its final (i.e., hardware) form.
Sounds good, yes.
>
>> I think that using const to pass data to architecture is great, thanks.
>>
>> Reinette
>
> I try to constify by default when straightforward to do so, since the
> compiler can then find which cases need to change; the reverse
> direction is harder to automate...
Could you please elaborate what you mean with "reverse direction"?
Thank you
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-23 17:27 ` Reinette Chatre
@ 2025-09-25 12:46 ` Dave Martin
2025-09-25 20:53 ` Reinette Chatre
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-09-25 12:46 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/22/25 7:39 AM, Dave Martin wrote:
> > Hi Reinette,
> >
> > Thanks for the review.
> >
> > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> nits:
> >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> >> resctrl code (and was established by Arm :)).
> >> Also please use upper case for acronym mba->MBA.
> >
> > Ack (the local custom in the MPAM code is to use "mba", but arguably,
> > the meaning is not quite the same -- I'll change it.)
>
> I am curious what the motivation is for the custom? Knowing this will help
> me to keep things consistent when the two worlds meet.
I think this has just evolved over time. On the x86 side, MBA is a
specific architectural feature, but on the MPAM side the architecture
doesn't really have a name for the same thing. Memory bandwidth is a
concept, but a few different types of control are defined for it, with
different names.
So, for the MPAM driver "mba" is more of a software concept than
something in a published spec: it's the glue that attaches to "MB"
resource as seen through resctrl.
(This isn't official though; it's just the mental model that I have
formed.)
>
> >>> The control value parser for the MB resource currently coerces the
> >>> memory bandwidth percentage value from userspace to be an exact
> >>> multiple of the bw_gran parameter.
> >>
> >> (to help be specific)
> >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"?
> >
> > "bw_gran" was intended as an informal shorthand for the abstract
> > parameter (exposed both in the field you mention and through the
> > bandiwidth_gran file in resctrl).
>
> I do not see a need for being abstract since the bandwidth_gran file exposes
> the field verbatim.
Sure; that was just my thought process.
> > I can rewrite it as per your suggestion, but this could be read as
> > excluding the bandwidth_gran file. Would it make sense just to write
> > it out longhand? For now, I've rewritten it as follows:
>
> Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran
> it is not clear to me how being specific excludes the bandwidth_gran file.
>
> >
> > | The control value parser for the MB resource currently coerces the
> > | memory bandwidth percentage value from userspace to be an exact
> > | multiple of the bandwidth granularity parameter.
>
> If you want to include the bandwidth_gran file then the above could be
> something like:
>
> The control value parser for the MB resource coerces the memory
> bandwidth percentage value from userspace to be an exact multiple
> of the bandwidth granularity parameter that is exposed by the
> bandwidth_gran resctrl file.
>
> I still think that replacing "the bandwidth granularity parameter" with
> "rdt_resource::resctrl_membw::bw_gran" will help to be more specific.
That's fine. I'll change as per your original suggestion.
> > |
> > | On MPAM systems, this results in somewhat worse-than-worst-case
> > | rounding, since the bandwidth granularity advertised to resctrl by the
> > | MPAM driver is in general only an approximation [...]
> >
> > (I'm happy to go with your suggestion if you're not keen on this,
> > though.)
> >
> >>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>> rounding, since bw_gran is in general only an approximation to the
> >>> actual hardware granularity on these systems, and the hardware
> >>> bandwidth allocation control value is not natively a percentage --
> >>> necessitating a further conversion in the resctrl_arch_update_domains()
> >>> path, regardless of the conversion done at parse time.
> >>>
> >>> Allow the arch to provide its own parse-time conversion that is
> >>> appropriate for the hardware, and move the existing conversion to x86.
> >>> This will avoid accumulated error from rounding the value twice on MPAM
> >>> systems.
> >>>
> >>> Clarify the documentation, but avoid overly exact promises.
> >>>
> >>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>> code, for now.
> >>
> >> Sounds like MPAM may be ready to start the schema parsing discussion again?
> >> I understand that MPAM has a few more ways to describe memory bandwidth as
> >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> >> schema format to user space, which seems like a good idea for new schema.
> >
> > My own ideas in this area are a little different, though I agree with
> > the general idea.
>
> Should we expect a separate proposal from James?
At some point, yes. We still need to have a chat about it.
Right now, I was just throwing an idea out there.
> > I'll respond separately on that, to avoid this thread getting off-topic.
>
> Much appreciated.
>
> >
> > For this patch, my aim was to avoid changing anything unnecessarily.
>
> Understood. More below as I try to understand the details but it does not
> really sound as though the current interface works that great for MPAM. If I
> understand correctly this patch enables MPAM to use existing interface for
> its memory bandwidth allocations but doing so does not enable users to
> obtain benefit of hardware capabilities. For that users would want to use
> the new interface?
In ideal world, probably, yes.
Since not all use cases will care about full precision, the MB resource
(approximated for MPAM) should be fine for a lot of people, but I
expect that sooner or later somebody will want more exact control.
> >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> >>> the other tests except for the NONCONT_CAT tests, which do not seem to
> >>> be supported in my configuration -- and have nothing to do with the
> >>> code touched by this patch).
> >>
> >> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >>
> >> The NONCONT_CAT tests may print error messages as debug information as part of
> >> running, but these errors are expected as part of the test. The test should accurately
> >> state whether it passed or failed though. For example, below attempts to write
> >> a non-contiguous CBM to a system that does not support non-contiguous masks.
> >> This fails as expected, error messages printed as debugging and thus the test passes
> >> with an "ok".
> >>
> >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
> >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
> >> ok 5 L3_NONCONT_CAT: test
> >
> > I don't think that this was anything to do with my changes, but I don't
> > still seem to have the test output. (Since this test has to do with
> > bitmap schemata (?), it seemed unlikely to be affected by changes to
> > bw_validate().)
>
> I agree that this should not have anything to do with this patch. My concern
> is that I understood that the test failed for a feature that is not supported.
> If this is the case then there may be a problem with the test. The test should
> not fail if the feature is not supported but instead skip the test.
I'll try to capture more output from this when I re-run it, so that we
can figure out what this is.
> >>> Notes:
> >>>
> >>> I put the x86 version out of line in order to avoid having to move
> >>> struct rdt_resource and its dependencies into resctrl_types.h -- which
> >>> would create a lot of diff noise. Schemata writes from userspace have
> >>> a high overhead in any case.
> >>
> >> Sounds good, I expect compiler will inline.
> >
> > The function and caller are in separate translation units, so unless
> > LTO is used, I don't think the function will be inlined.
>
> Thanks, yes, indeed.
>
> >
> >>>
> >>> For MPAM the conversion will be a no-op, because the incoming
> >>> percentage from the core resctrl code needs to be converted to hardware
> >>> representation in the driver anyway.
> >>
> >> (addressed below)
> >>
> >>>
> >>> Perhaps _all_ the types should move to resctrl_types.h.
> >>
> >> Can surely consider when there is a good motivation.
> >>
> >>>
> >>> For now, I went for the smallest diffstat...
> >
> > I'll assume the motivation is not strong enough for now, but shout if
> > you disagree.
>
> I agree.
OK, I'll leave that as-is for now, then.
> >
> > [...]
> >
> >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>> index c7949dd44f2f..a1d0469d6dfb 100644
> >>> --- a/Documentation/filesystems/resctrl.rst
> >>> +++ b/Documentation/filesystems/resctrl.rst
> >>> @@ -143,12 +143,11 @@ with respect to allocation:
> >>> user can request.
> >>>
> >>> "bandwidth_gran":
> >>> - The granularity in which the memory bandwidth
> >>> + The approximate granularity in which the memory bandwidth
> >>> percentage is allocated. The allocated
> >>> b/w percentage is rounded off to the next
> >>> - control step available on the hardware. The
> >>> - available bandwidth control steps are:
> >>> - min_bandwidth + N * bandwidth_gran.
> >>> + control step available on the hardware. The available
> >>> + steps are at least as small as this value.
> >>
> >> A bit difficult to parse for me.
> >> Is "at least as small as" same as "at least"?
> >
> > It was supposed to mean: "The available steps are no larger than this
> > value."
>
> This is clear to me, especially when compared with the planned addition to
> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
> the paragraph below (more below).
>
> >
> > Formally My expectation is that this value is the smallest integer
> > number of percent which is not smaller than the apparent size of any
> > individual rounding step. Equivalently, this is the smallest number g
>
> Considering the two statements:
> - "The available steps are no larger than this value."
> - "this value ... is not smaller than the apparent size of any individual rounding step"
>
> The "not larger" and "not smaller" sounds like all these words just end up saying that
> this is the step size?
They are intended to be the same statement: A <= B versus
B >= A respectively.
But I'd be the first to admit that the wording is a bit twisted!
(I wouldn't be astonshed if I got something wrong somewhere.)
See below for an alternative way of describing this that might be more
intuitive.
>
> > for which writing "MB: 0=x" and "MB: 0=y" yield different
> > configurations for every in-range x and where y = x + g and y is also
> > in-range.
> >
> > That's a bit of a mouthful, though. If you can think of a more
> > succinct way of putting it, I'm open to suggestions!
> >
> >> Please note that the documentation has a section "Memory bandwidth Allocation
> >> and monitoring" that also contains these exact promises.
> >
> > Hmmm, somehow I completely missed that.
> >
> > Does the following make sense? Ideally, there would be a simpler way
> > to describe the discrepancy between the reported and actual values of
> > bw_gran...
> >
> > | Memory bandwidth Allocation and monitoring
> > | ==========================================
> > |
> > | [...]
> > |
> > | The minimum bandwidth percentage value for each cpu model is predefined
> > | and can be looked up through "info/MB/min_bandwidth". The bandwidth
> > | granularity that is allocated is also dependent on the cpu model and can
> > | be looked up at "info/MB/bandwidth_gran". The available bandwidth
> > | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> > | -to the next control step available on the hardware.
> > | +control steps are: min_bw + N * (bw_gran - e), where e is a
> > | +non-negative, hardware-defined real constant that is less than 1.
> > | +Intermediate values are rounded to the next control step available on
> > | +the hardware.
> > | +
> > | +At the time of writing, the constant e referred to in the preceding
> > | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> > | +describes the step size exactly), but this may not be the case on other
> > | +hardware when the actual granularity is not an exact divisor of 100.
>
> Have you considered how to share the value of "e" with users?
Perhaps introducing this "e" as an explicit parameter is a bad idea and
overly formal. In practice, there are likely to various sources of
skid and approximation in the hardware, so exposing an actual value may
be counterproductive -- i.e., what usable guarantee is this providing
to userspace, if this is likely to be swamped by approximations
elsewhere?
Instead, maybe we can just say something like:
| The available steps are spaced at roughly equal intervals between the
| value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
| info/MB/bandwidth_gran gives the worst-case precision of these
| interval steps, in per cent.
What do you think?
If that's adequate, then the wording under the definition of
"bandwidth_gran" could be aligned with this.
> >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> >>> index d98e0d2de09f..c5e73b75aaa0 100644
> >>> --- a/fs/resctrl/ctrlmondata.c
> >>> +++ b/fs/resctrl/ctrlmondata.c
> >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >>> return false;
> >>> }
> >>>
> >>> - *data = roundup(bw, (unsigned long)r->membw.bw_gran);
> >>> + *data = resctrl_arch_round_bw(bw, r);
> >>
> >> Please check that function comments remain accurate after changes (specifically
> >> if making the conversion more generic as proposed below).
> >
> > I hoped that the comment for this function was still applicable, though
> > it can probably be improved. How about the following?
> >
> > | - * hardware. The allocated bandwidth percentage is rounded to the next
> > | - * control step available on the hardware.
> > | + * hardware. The allocated bandwidth percentage is converted as
> > | + * appropriate for consumption by the specific hardware driver.
> >
> > [...]
>
> Looks good to me.
OK.
> >
> >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> >>> index 6fb4894b8cfd..5b2a555cf2dd 100644
> >>> --- a/include/linux/resctrl.h
> >>> +++ b/include/linux/resctrl.h
> >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid,
> >>> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> >>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> >>>
> >>> +/*
> >>> + * Round a bandwidth control value to the nearest value acceptable to
> >>> + * the arch code for resource r:
> >>> + */
> >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r);
> >>> +
> >>
> >> I do not think that resctrl should make any assumptions on what the
> >> architecture's conversion does (i.e "round"). That architecture needs to be
> >> asked to "round a bandwidth control value" also sounds strange since resctrl really
> >> should be able to do something like rounding itself. As I understand from
> >> the notes this will be a no-op for MPAM making this even more confusing.
> >>
> >> How about naming the helper something like resctrl_arch_convert_bw()?
> >> (Open to other ideas of course).
> >>
> >> If you make such a change, please check that subject of patch still fits.
> >
> > I struggled a bit with the name. Really, this is converting the value
> > to an intermediate form (which might or might not involve rounding).
> > For historical reasons, this is a value suitable for writing directly
> > to the relevant x86 MSR without any further interpretation.
> >
> > For MPAM, it is convenient to do this conversion later, rather than
> > during parsing of the value.
> >
> >
> > Would a name like resctrl_arch_preconvert_bw() be acceptable?
>
> Yes.
>
> >
> > This isn't more informative than your suggestion regarding what the
> > conversion is expected to do, but may convey the expectation that the
> > output value may still not be in its final (i.e., hardware) form.
>
> Sounds good, yes.
OK, I'll hack that in.
>
> >
> >> I think that using const to pass data to architecture is great, thanks.
> >>
> >> Reinette
> >
> > I try to constify by default when straightforward to do so, since the
> > compiler can then find which cases need to change; the reverse
> > direction is harder to automate...
>
> Could you please elaborate what you mean with "reverse direction"?
I just meant that over-consting tends to result in violations of the
language that the compiler will detect, but under-consting doesn't:
static void foo(int *nonconstp, const int *constp)
{
*constp = 0; // compiler error
(*nonconstp); // silently accpeted, though it could have been const
}
So, the compiler will tell you places where const needs to be removed
(or something else needs to change), but to find places where const
could be _added_, you have to hunt them down yourself, or use some
other tool that is probably not part of the usual workflow.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-25 12:46 ` Dave Martin
@ 2025-09-25 20:53 ` Reinette Chatre
2025-09-25 21:35 ` Luck, Tony
2025-09-29 12:43 ` Dave Martin
0 siblings, 2 replies; 52+ messages in thread
From: Reinette Chatre @ 2025-09-25 20:53 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 9/25/25 5:46 AM, Dave Martin wrote:
> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> nits:
>>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
>>>> resctrl code (and was established by Arm :)).
>>>> Also please use upper case for acronym mba->MBA.
>>>
>>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
>>> the meaning is not quite the same -- I'll change it.)
>>
>> I am curious what the motivation is for the custom? Knowing this will help
>> me to keep things consistent when the two worlds meet.
>
> I think this has just evolved over time. On the x86 side, MBA is a
> specific architectural feature, but on the MPAM side the architecture
> doesn't really have a name for the same thing. Memory bandwidth is a
> concept, but a few different types of control are defined for it, with
> different names.
>
> So, for the MPAM driver "mba" is more of a software concept than
> something in a published spec: it's the glue that attaches to "MB"
> resource as seen through resctrl.
>
> (This isn't official though; it's just the mental model that I have
> formed.)
I see. Thank you for the details. My mental model is simpler: write acronyms
in upper case.
...
>>>>> On MPAM systems, this results in somewhat worse-than-worst-case
>>>>> rounding, since bw_gran is in general only an approximation to the
>>>>> actual hardware granularity on these systems, and the hardware
>>>>> bandwidth allocation control value is not natively a percentage --
>>>>> necessitating a further conversion in the resctrl_arch_update_domains()
>>>>> path, regardless of the conversion done at parse time.
>>>>>
>>>>> Allow the arch to provide its own parse-time conversion that is
>>>>> appropriate for the hardware, and move the existing conversion to x86.
>>>>> This will avoid accumulated error from rounding the value twice on MPAM
>>>>> systems.
>>>>>
>>>>> Clarify the documentation, but avoid overly exact promises.
>>>>>
>>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>>>> code, for now.
>>>>
>>>> Sounds like MPAM may be ready to start the schema parsing discussion again?
>>>> I understand that MPAM has a few more ways to describe memory bandwidth as
>>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
>>>> schema format to user space, which seems like a good idea for new schema.
>>>
>>> My own ideas in this area are a little different, though I agree with
>>> the general idea.
>>
>> Should we expect a separate proposal from James?
>
> At some point, yes. We still need to have a chat about it.
>
> Right now, I was just throwing an idea out there.
Thank you very much for doing so. We are digesting it.
...
>>> For this patch, my aim was to avoid changing anything unnecessarily.
>>
>> Understood. More below as I try to understand the details but it does not
>> really sound as though the current interface works that great for MPAM. If I
>> understand correctly this patch enables MPAM to use existing interface for
>> its memory bandwidth allocations but doing so does not enable users to
>> obtain benefit of hardware capabilities. For that users would want to use
>> the new interface?
>
> In ideal world, probably, yes.
>
> Since not all use cases will care about full precision, the MB resource
> (approximated for MPAM) should be fine for a lot of people, but I
> expect that sooner or later somebody will want more exact control.
ack.
>
>>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
>>>>> the other tests except for the NONCONT_CAT tests, which do not seem to
>>>>> be supported in my configuration -- and have nothing to do with the
>>>>> code touched by this patch).
>>>>
>>>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
>>>>
>>>> The NONCONT_CAT tests may print error messages as debug information as part of
>>>> running, but these errors are expected as part of the test. The test should accurately
>>>> state whether it passed or failed though. For example, below attempts to write
>>>> a non-contiguous CBM to a system that does not support non-contiguous masks.
>>>> This fails as expected, error messages printed as debugging and thus the test passes
>>>> with an "ok".
>>>>
>>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
>>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
>>>> ok 5 L3_NONCONT_CAT: test
>>>
>>> I don't think that this was anything to do with my changes, but I don't
>>> still seem to have the test output. (Since this test has to do with
>>> bitmap schemata (?), it seemed unlikely to be affected by changes to
>>> bw_validate().)
>>
>> I agree that this should not have anything to do with this patch. My concern
>> is that I understood that the test failed for a feature that is not supported.
>> If this is the case then there may be a problem with the test. The test should
>> not fail if the feature is not supported but instead skip the test.
>
> I'll try to capture more output from this when I re-run it, so that we
> can figure out what this is.
Thank you.
...
>>>
>>> [...]
>>>
>>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>>>>> index c7949dd44f2f..a1d0469d6dfb 100644
>>>>> --- a/Documentation/filesystems/resctrl.rst
>>>>> +++ b/Documentation/filesystems/resctrl.rst
>>>>> @@ -143,12 +143,11 @@ with respect to allocation:
>>>>> user can request.
>>>>>
>>>>> "bandwidth_gran":
>>>>> - The granularity in which the memory bandwidth
>>>>> + The approximate granularity in which the memory bandwidth
>>>>> percentage is allocated. The allocated
>>>>> b/w percentage is rounded off to the next
>>>>> - control step available on the hardware. The
>>>>> - available bandwidth control steps are:
>>>>> - min_bandwidth + N * bandwidth_gran.
>>>>> + control step available on the hardware. The available
>>>>> + steps are at least as small as this value.
>>>>
>>>> A bit difficult to parse for me.
>>>> Is "at least as small as" same as "at least"?
>>>
>>> It was supposed to mean: "The available steps are no larger than this
>>> value."
>>
>> This is clear to me, especially when compared with the planned addition to
>> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
>> the paragraph below (more below).
>>
>>>
>>> Formally My expectation is that this value is the smallest integer
>>> number of percent which is not smaller than the apparent size of any
>>> individual rounding step. Equivalently, this is the smallest number g
>>
>> Considering the two statements:
>> - "The available steps are no larger than this value."
>> - "this value ... is not smaller than the apparent size of any individual rounding step"
>>
>> The "not larger" and "not smaller" sounds like all these words just end up saying that
>> this is the step size?
>
> They are intended to be the same statement: A <= B versus
> B >= A respectively.
This is what I understood from the words ... and that made me think that it
can be simplified to A = B ... but no need to digress ... onto the alternatives below ...
>
> But I'd be the first to admit that the wording is a bit twisted!
> (I wouldn't be astonshed if I got something wrong somewhere.)
>
> See below for an alternative way of describing this that might be more
> intuitive.
>
>>
>>> for which writing "MB: 0=x" and "MB: 0=y" yield different
>>> configurations for every in-range x and where y = x + g and y is also
>>> in-range.
>>>
>>> That's a bit of a mouthful, though. If you can think of a more
>>> succinct way of putting it, I'm open to suggestions!
>>>
>>>> Please note that the documentation has a section "Memory bandwidth Allocation
>>>> and monitoring" that also contains these exact promises.
>>>
>>> Hmmm, somehow I completely missed that.
>>>
>>> Does the following make sense? Ideally, there would be a simpler way
>>> to describe the discrepancy between the reported and actual values of
>>> bw_gran...
>>>
>>> | Memory bandwidth Allocation and monitoring
>>> | ==========================================
>>> |
>>> | [...]
>>> |
>>> | The minimum bandwidth percentage value for each cpu model is predefined
>>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>> | granularity that is allocated is also dependent on the cpu model and can
>>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>> | -to the next control step available on the hardware.
>>> | +control steps are: min_bw + N * (bw_gran - e), where e is a
>>> | +non-negative, hardware-defined real constant that is less than 1.
>>> | +Intermediate values are rounded to the next control step available on
>>> | +the hardware.
>>> | +
>>> | +At the time of writing, the constant e referred to in the preceding
>>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
>>> | +describes the step size exactly), but this may not be the case on other
>>> | +hardware when the actual granularity is not an exact divisor of 100.
>>
>> Have you considered how to share the value of "e" with users?
>
> Perhaps introducing this "e" as an explicit parameter is a bad idea and
> overly formal. In practice, there are likely to various sources of
> skid and approximation in the hardware, so exposing an actual value may
> be counterproductive -- i.e., what usable guarantee is this providing
> to userspace, if this is likely to be swamped by approximations
> elsewhere?
>
> Instead, maybe we can just say something like:
>
> | The available steps are spaced at roughly equal intervals between the
> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
> | info/MB/bandwidth_gran gives the worst-case precision of these
> | interval steps, in per cent.
>
> What do you think?
I find "worst-case precision" a bit confusing, consider for example, what
would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
the upper limit of these interval steps"? I believe this matches what you
mentioned a couple of messages ago: "The available steps are no larger than this
value."
(and "per cent" -> "percent")
>
> If that's adequate, then the wording under the definition of
> "bandwidth_gran" could be aligned with this.
I think putting together a couple of your proposals and statements while making the
text more accurate may work:
"bandwidth_gran":
The approximate granularity in which the memory bandwidth
percentage is allocated. The allocated bandwidth percentage
is rounded up to the next control step available on the
hardware. The available hardware steps are no larger than
this value.
I assume "available" is needed because, even though the steps are not larger
than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
to 100% range?
>>>> I think that using const to pass data to architecture is great, thanks.
>>>>
>>>> Reinette
>>>
>>> I try to constify by default when straightforward to do so, since the
>>> compiler can then find which cases need to change; the reverse
>>> direction is harder to automate...
>>
>> Could you please elaborate what you mean with "reverse direction"?
>
> I just meant that over-consting tends to result in violations of the
> language that the compiler will detect, but under-consting doesn't:
>
> static void foo(int *nonconstp, const int *constp)
> {
> *constp = 0; // compiler error
> (*nonconstp); // silently accpeted, though it could have been const
> }
>
> So, the compiler will tell you places where const needs to be removed
> (or something else needs to change), but to find places where const
> could be _added_, you have to hunt them down yourself, or use some
> other tool that is probably not part of the usual workflow.
Got it, thanks.
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-25 20:53 ` Reinette Chatre
@ 2025-09-25 21:35 ` Luck, Tony
2025-09-25 22:18 ` Reinette Chatre
2025-09-29 12:43 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2025-09-25 21:35 UTC (permalink / raw)
To: Reinette Chatre
Cc: Dave Martin, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/25/25 5:46 AM, Dave Martin wrote:
> > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> >> On 9/22/25 7:39 AM, Dave Martin wrote:
> >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >>>> Hi Dave,
> >>>>
> >>>> nits:
> >>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other
> >>>> resctrl code (and was established by Arm :)).
> >>>> Also please use upper case for acronym mba->MBA.
> >>>
> >>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
> >>> the meaning is not quite the same -- I'll change it.)
> >>
> >> I am curious what the motivation is for the custom? Knowing this will help
> >> me to keep things consistent when the two worlds meet.
> >
> > I think this has just evolved over time. On the x86 side, MBA is a
> > specific architectural feature, but on the MPAM side the architecture
> > doesn't really have a name for the same thing. Memory bandwidth is a
> > concept, but a few different types of control are defined for it, with
> > different names.
> >
> > So, for the MPAM driver "mba" is more of a software concept than
> > something in a published spec: it's the glue that attaches to "MB"
> > resource as seen through resctrl.
> >
> > (This isn't official though; it's just the mental model that I have
> > formed.)
>
> I see. Thank you for the details. My mental model is simpler: write acronyms
> in upper case.
>
> ...
>
> >>>>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>>>> rounding, since bw_gran is in general only an approximation to the
> >>>>> actual hardware granularity on these systems, and the hardware
> >>>>> bandwidth allocation control value is not natively a percentage --
> >>>>> necessitating a further conversion in the resctrl_arch_update_domains()
> >>>>> path, regardless of the conversion done at parse time.
> >>>>>
> >>>>> Allow the arch to provide its own parse-time conversion that is
> >>>>> appropriate for the hardware, and move the existing conversion to x86.
> >>>>> This will avoid accumulated error from rounding the value twice on MPAM
> >>>>> systems.
> >>>>>
> >>>>> Clarify the documentation, but avoid overly exact promises.
> >>>>>
> >>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>>>> code, for now.
> >>>>
> >>>> Sounds like MPAM may be ready to start the schema parsing discussion again?
> >>>> I understand that MPAM has a few more ways to describe memory bandwidth as
> >>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> >>>> schema format to user space, which seems like a good idea for new schema.
> >>>
> >>> My own ideas in this area are a little different, though I agree with
> >>> the general idea.
> >>
> >> Should we expect a separate proposal from James?
> >
> > At some point, yes. We still need to have a chat about it.
> >
> > Right now, I was just throwing an idea out there.
>
> Thank you very much for doing so. We are digesting it.
>
>
> ...
>
> >>> For this patch, my aim was to avoid changing anything unnecessarily.
> >>
> >> Understood. More below as I try to understand the details but it does not
> >> really sound as though the current interface works that great for MPAM. If I
> >> understand correctly this patch enables MPAM to use existing interface for
> >> its memory bandwidth allocations but doing so does not enable users to
> >> obtain benefit of hardware capabilities. For that users would want to use
> >> the new interface?
> >
> > In ideal world, probably, yes.
> >
> > Since not all use cases will care about full precision, the MB resource
> > (approximated for MPAM) should be fine for a lot of people, but I
> > expect that sooner or later somebody will want more exact control.
>
> ack.
>
> >
> >>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> >>>>> the other tests except for the NONCONT_CAT tests, which do not seem to
> >>>>> be supported in my configuration -- and have nothing to do with the
> >>>>> code touched by this patch).
> >>>>
> >>>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >>>>
> >>>> The NONCONT_CAT tests may print error messages as debug information as part of
> >>>> running, but these errors are expected as part of the test. The test should accurately
> >>>> state whether it passed or failed though. For example, below attempts to write
> >>>> a non-contiguous CBM to a system that does not support non-contiguous masks.
> >>>> This fails as expected, error messages printed as debugging and thus the test passes
> >>>> with an "ok".
> >>>>
> >>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
> >>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
> >>>> ok 5 L3_NONCONT_CAT: test
> >>>
> >>> I don't think that this was anything to do with my changes, but I don't
> >>> still seem to have the test output. (Since this test has to do with
> >>> bitmap schemata (?), it seemed unlikely to be affected by changes to
> >>> bw_validate().)
> >>
> >> I agree that this should not have anything to do with this patch. My concern
> >> is that I understood that the test failed for a feature that is not supported.
> >> If this is the case then there may be a problem with the test. The test should
> >> not fail if the feature is not supported but instead skip the test.
> >
> > I'll try to capture more output from this when I re-run it, so that we
> > can figure out what this is.
>
> Thank you.
>
>
> ...
>
> >>>
> >>> [...]
> >>>
> >>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>>>> index c7949dd44f2f..a1d0469d6dfb 100644
> >>>>> --- a/Documentation/filesystems/resctrl.rst
> >>>>> +++ b/Documentation/filesystems/resctrl.rst
> >>>>> @@ -143,12 +143,11 @@ with respect to allocation:
> >>>>> user can request.
> >>>>>
> >>>>> "bandwidth_gran":
> >>>>> - The granularity in which the memory bandwidth
> >>>>> + The approximate granularity in which the memory bandwidth
> >>>>> percentage is allocated. The allocated
> >>>>> b/w percentage is rounded off to the next
> >>>>> - control step available on the hardware. The
> >>>>> - available bandwidth control steps are:
> >>>>> - min_bandwidth + N * bandwidth_gran.
> >>>>> + control step available on the hardware. The available
> >>>>> + steps are at least as small as this value.
> >>>>
> >>>> A bit difficult to parse for me.
> >>>> Is "at least as small as" same as "at least"?
> >>>
> >>> It was supposed to mean: "The available steps are no larger than this
> >>> value."
> >>
> >> This is clear to me, especially when compared with the planned addition to
> >> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting
> >> the paragraph below (more below).
> >>
> >>>
> >>> Formally My expectation is that this value is the smallest integer
> >>> number of percent which is not smaller than the apparent size of any
> >>> individual rounding step. Equivalently, this is the smallest number g
> >>
> >> Considering the two statements:
> >> - "The available steps are no larger than this value."
> >> - "this value ... is not smaller than the apparent size of any individual rounding step"
> >>
> >> The "not larger" and "not smaller" sounds like all these words just end up saying that
> >> this is the step size?
> >
> > They are intended to be the same statement: A <= B versus
> > B >= A respectively.
>
> This is what I understood from the words ... and that made me think that it
> can be simplified to A = B ... but no need to digress ... onto the alternatives below ...
>
> >
> > But I'd be the first to admit that the wording is a bit twisted!
> > (I wouldn't be astonshed if I got something wrong somewhere.)
> >
> > See below for an alternative way of describing this that might be more
> > intuitive.
> >
> >>
> >>> for which writing "MB: 0=x" and "MB: 0=y" yield different
> >>> configurations for every in-range x and where y = x + g and y is also
> >>> in-range.
> >>>
> >>> That's a bit of a mouthful, though. If you can think of a more
> >>> succinct way of putting it, I'm open to suggestions!
> >>>
> >>>> Please note that the documentation has a section "Memory bandwidth Allocation
> >>>> and monitoring" that also contains these exact promises.
> >>>
> >>> Hmmm, somehow I completely missed that.
> >>>
> >>> Does the following make sense? Ideally, there would be a simpler way
> >>> to describe the discrepancy between the reported and actual values of
> >>> bw_gran...
> >>>
> >>> | Memory bandwidth Allocation and monitoring
> >>> | ==========================================
> >>> |
> >>> | [...]
> >>> |
> >>> | The minimum bandwidth percentage value for each cpu model is predefined
> >>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >>> | granularity that is allocated is also dependent on the cpu model and can
> >>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth
> >>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> >>> | -to the next control step available on the hardware.
> >>> | +control steps are: min_bw + N * (bw_gran - e), where e is a
> >>> | +non-negative, hardware-defined real constant that is less than 1.
> >>> | +Intermediate values are rounded to the next control step available on
> >>> | +the hardware.
> >>> | +
> >>> | +At the time of writing, the constant e referred to in the preceding
> >>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
> >>> | +describes the step size exactly), but this may not be the case on other
> >>> | +hardware when the actual granularity is not an exact divisor of 100.
> >>
> >> Have you considered how to share the value of "e" with users?
> >
> > Perhaps introducing this "e" as an explicit parameter is a bad idea and
> > overly formal. In practice, there are likely to various sources of
> > skid and approximation in the hardware, so exposing an actual value may
> > be counterproductive -- i.e., what usable guarantee is this providing
> > to userspace, if this is likely to be swamped by approximations
> > elsewhere?
> >
> > Instead, maybe we can just say something like:
> >
> > | The available steps are spaced at roughly equal intervals between the
> > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
> > | info/MB/bandwidth_gran gives the worst-case precision of these
> > | interval steps, in per cent.
> >
> > What do you think?
>
> I find "worst-case precision" a bit confusing, consider for example, what
> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
> the upper limit of these interval steps"? I believe this matches what you
> mentioned a couple of messages ago: "The available steps are no larger than this
> value."
>
> (and "per cent" -> "percent")
>
> >
> > If that's adequate, then the wording under the definition of
> > "bandwidth_gran" could be aligned with this.
>
> I think putting together a couple of your proposals and statements while making the
> text more accurate may work:
>
> "bandwidth_gran":
> The approximate granularity in which the memory bandwidth
> percentage is allocated. The allocated bandwidth percentage
> is rounded up to the next control step available on the
> hardware. The available hardware steps are no larger than
> this value.
>
> I assume "available" is needed because, even though the steps are not larger
> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
> to 100% range?
What values are allowed for "bandwidth_gran"? The "Intel® Resource
Director Technology (Intel® RDT) Architecture Specification"
https://cdrdv2.intel.com/v1/dl/getContent/789566
describes the upcoming region aware memory bandwidth allocation
controls as being a number from "1" to "Q" (enumerated in an ACPI
table). First implementation looks like Q == 255 which means a
granularity of 0.392% The spec has headroom to allow Q == 511.
I don't expect users to need that granularity at the high bandwidth
end of the range, but I do expect them to care for highly throttled
background/batch jobs to make sure they can't affect performance of
the high priority jobs.
I'd hate to have to round all low bandwidth controls to 1% steps.
>
> >>>> I think that using const to pass data to architecture is great, thanks.
> >>>>
> >>>> Reinette
> >>>
> >>> I try to constify by default when straightforward to do so, since the
> >>> compiler can then find which cases need to change; the reverse
> >>> direction is harder to automate...
> >>
> >> Could you please elaborate what you mean with "reverse direction"?
> >
> > I just meant that over-consting tends to result in violations of the
> > language that the compiler will detect, but under-consting doesn't:
> >
> > static void foo(int *nonconstp, const int *constp)
> > {
> > *constp = 0; // compiler error
> > (*nonconstp); // silently accpeted, though it could have been const
> > }
> >
> > So, the compiler will tell you places where const needs to be removed
> > (or something else needs to change), but to find places where const
> > could be _added_, you have to hunt them down yourself, or use some
> > other tool that is probably not part of the usual workflow.
>
> Got it, thanks.
>
> Reinette
>
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-25 21:35 ` Luck, Tony
@ 2025-09-25 22:18 ` Reinette Chatre
2025-09-29 13:08 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-09-25 22:18 UTC (permalink / raw)
To: Luck, Tony
Cc: Dave Martin, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Tony,
On 9/25/25 2:35 PM, Luck, Tony wrote:
> On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote:
>> On 9/25/25 5:46 AM, Dave Martin wrote:
>>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>>>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
...
>>>>> for which writing "MB: 0=x" and "MB: 0=y" yield different
>>>>> configurations for every in-range x and where y = x + g and y is also
>>>>> in-range.
>>>>>
>>>>> That's a bit of a mouthful, though. If you can think of a more
>>>>> succinct way of putting it, I'm open to suggestions!
>>>>>
>>>>>> Please note that the documentation has a section "Memory bandwidth Allocation
>>>>>> and monitoring" that also contains these exact promises.
>>>>>
>>>>> Hmmm, somehow I completely missed that.
>>>>>
>>>>> Does the following make sense? Ideally, there would be a simpler way
>>>>> to describe the discrepancy between the reported and actual values of
>>>>> bw_gran...
>>>>>
>>>>> | Memory bandwidth Allocation and monitoring
>>>>> | ==========================================
>>>>> |
>>>>> | [...]
>>>>> |
>>>>> | The minimum bandwidth percentage value for each cpu model is predefined
>>>>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>>>> | granularity that is allocated is also dependent on the cpu model and can
>>>>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>>>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>>>> | -to the next control step available on the hardware.
>>>>> | +control steps are: min_bw + N * (bw_gran - e), where e is a
>>>>> | +non-negative, hardware-defined real constant that is less than 1.
>>>>> | +Intermediate values are rounded to the next control step available on
>>>>> | +the hardware.
>>>>> | +
>>>>> | +At the time of writing, the constant e referred to in the preceding
>>>>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran
>>>>> | +describes the step size exactly), but this may not be the case on other
>>>>> | +hardware when the actual granularity is not an exact divisor of 100.
>>>>
>>>> Have you considered how to share the value of "e" with users?
>>>
>>> Perhaps introducing this "e" as an explicit parameter is a bad idea and
>>> overly formal. In practice, there are likely to various sources of
>>> skid and approximation in the hardware, so exposing an actual value may
>>> be counterproductive -- i.e., what usable guarantee is this providing
>>> to userspace, if this is likely to be swamped by approximations
>>> elsewhere?
>>>
>>> Instead, maybe we can just say something like:
>>>
>>> | The available steps are spaced at roughly equal intervals between the
>>> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
>>> | info/MB/bandwidth_gran gives the worst-case precision of these
>>> | interval steps, in per cent.
>>>
>>> What do you think?
>>
>> I find "worst-case precision" a bit confusing, consider for example, what
>> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
>> the upper limit of these interval steps"? I believe this matches what you
>> mentioned a couple of messages ago: "The available steps are no larger than this
>> value."
>>
>> (and "per cent" -> "percent")
>>
>>>
>>> If that's adequate, then the wording under the definition of
>>> "bandwidth_gran" could be aligned with this.
>>
>> I think putting together a couple of your proposals and statements while making the
>> text more accurate may work:
>>
>> "bandwidth_gran":
>> The approximate granularity in which the memory bandwidth
>> percentage is allocated. The allocated bandwidth percentage
>> is rounded up to the next control step available on the
>> hardware. The available hardware steps are no larger than
>> this value.
>>
>> I assume "available" is needed because, even though the steps are not larger
>> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
>> to 100% range?
>
> What values are allowed for "bandwidth_gran"? The "Intel® Resource
This is a property of the MB resource where the ABI is to express allocations
as a percentage. Current doc:
"bandwidth_gran":
The granularity in which the memory bandwidth
percentage is allocated. The allocated
b/w percentage is rounded off to the next
control step available on the hardware. The
available bandwidth control steps are:
min_bandwidth + N * bandwidth_gran.
I do not expect we can switch it to fractions so I would say that
integer values are allowed, starting at 1.
I understand that the MB resource on AMD supports different ranges and
I find that ABI discrepancy unfortunate. I do not think this should be
seen as an opportunity that "anything goes" when it comes to MB and used as
an excuse to pile on another range of hardware dependent inputs. Instead I
believe we should keep MB interface as-is and instead work on a generic
interface that enables user space to interact with resctrl to have benefit
of all hardware capabilities without needing to know which hardware is
underneath.
> Director Technology (Intel® RDT) Architecture Specification"
>
> https://cdrdv2.intel.com/v1/dl/getContent/789566
>
> describes the upcoming region aware memory bandwidth allocation
> controls as being a number from "1" to "Q" (enumerated in an ACPI
> table). First implementation looks like Q == 255 which means a
> granularity of 0.392% The spec has headroom to allow Q == 511.
>
> I don't expect users to need that granularity at the high bandwidth
> end of the range, but I do expect them to care for highly throttled
> background/batch jobs to make sure they can't affect performance of
> the high priority jobs.
>
> I'd hate to have to round all low bandwidth controls to 1% steps.
This is the limitation if choosing to expose this feature as an MB resource
and seems to be the same problem that Dave is facing. For finer granularity
allocations I expect that we would need a new schema/resource backed by new
properties as proposed by Dave in
https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
This will require updates to user space (that will anyway be needed if wedging
another non-ABI input into MB).
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-22 15:04 ` Dave Martin
@ 2025-09-25 22:58 ` Luck, Tony
2025-09-29 9:19 ` Chen, Yu C
2025-09-29 13:56 ` Dave Martin
2025-09-26 20:54 ` Reinette Chatre
1 sibling, 2 replies; 52+ messages in thread
From: Luck, Tony @ 2025-09-25 22:58 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> Hi again,
>
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>
> [...]
>
> > > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > > code, for now.
> >
> > Sounds like MPAM may be ready to start the schema parsing discussion again?
> > I understand that MPAM has a few more ways to describe memory bandwidth as
> > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
> > schema format to user space, which seems like a good idea for new schema.
>
> On this topic, specifically:
>
>
> My own ideas in this area are a little different, though I agree with
> the general idea.
>
> Bitmap controls are distinct from numeric values, but for numbers, I'm
> not sure that distinguishing percentages from other values is required,
> since this is really just a specific case of a linear scale.
>
> I imagined a generic numeric schema, described by a set of files like
> the following in a schema's info directory:
>
> min: minimum value, e.g., 1
> max: maximum value, e.g., 1023
> scale: value that corresponds to one unit
> unit: quantified base unit, e.g., "100pc", "64MBps"
> map: mapping function name
>
> If s is the value written in a schemata entry and p is the
> corresponding physical amount of resource, then
>
> min <= s <= max
>
> and
>
> p = map(s / scale) * unit
>
> One reason why I prefer this scaling scheme over the floating-point
> approach is that it can be exact (at least for currently known
> platforms), and it doesn't require a new floating-point parser/
> formatter to be written for this one thing in the kernel (which I
> suspect is likely to be error-prone and poorly defined around
> subtleties such as rounding behaviour).
>
> "map" anticipates non-linear ramps, but this is only really here as a
> forwards compatibility get-out. For now, this might just be set to
> "none", meaning the identity mapping (i.e., a no-op). This may shadow
> the existing the "delay_linear" parameter, but with more general
> applicabillity if we need it.
>
>
> The idea is that userspace reads the info files and then does the
> appropriate conversions itself. This might or might not be seen as a
> burden, but would give exact control over the hardware configuration
> with a generic interface, with possibly greater precision than the
> existing schemata allow (when the hardware supports it), and without
> having to second-guess the rounding that the kernel may or may not do
> on the values.
>
> For RDT MBA, we might have
>
> min: 10
> max: 100
> scale: 100
> unit: 100pc
> map: none
>
> The schemata entry
>
> MB: 0=10, 1=100
>
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1.
>
>
> For AMD SMBA, we might have:
>
> min: 1
> max: 100
> scale: 8
> unit: 1GBps
>
> (if I've understood this correctly from resctrl.rst.)
>
>
> For MPAM MBW_MAX with, say, 6 bits of resolution, we might have:
>
> min: 1
> max: 64
> scale: 64
> unit: 100pc
> map: none
>
> The schemata entry
>
> MB: 0=1,1=64
>
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1. This would probably need to be a new schema,
> since we already have "MB" mimicking x86.
>
> Exposing the hardware scale in this way would give userspace precise
> control (including in sub-1% increments on capable hardware), without
> having to second-guess the way the kernel will round the values.
>
>
> > Is this something MPAM is still considering? For example, the minimum
> > and maximum ranges that can be specified, is this something you already
> > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes
> > discussion on how to handle min/max ranges for bandwidth?
>
> This seems to be a different thing. I think James had some thoughts on
> this already -- I haven't checked on his current idea, but one option
> would be simply to expose this as two distinct schemata, say MB_MIN,
> MB_MAX.
>
> There's a question of how to cope with multiple different schemata
> entries that shadow each other (i.e., control the same hardware
> resource).
>
>
> Would something like the following work? A read from schemata might
> produce something like this:
>
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=31, 1=31
> # MB_MAX: 0=32, 1=32
>
> (Where MB_HW is the MPAM schema with 6-bit resolution that I
> illustrated above, and MB_MIN and MB_MAX are similar schemata for the
> specific MIN and MAX controls in the hardware.)
>
> Userspace that does not understand the new entries would need to ignore
> the commented lines, but can otherwise safely alter and write back the
> schemata with the expected results. The kernel would in turn ignore
> the commented lines on write. The commented lines are meaningful but
> "inactive": they describe the current hardware configuration on read,
> but (unless explicitly uncommented) won't change anything on write.
>
> Software that understands the new entries can uncomment the conflicting
> entries and write them back instead of (or in addition to) the
> conflicting entries. For example, userspace might write the following:
>
> MB_MIN: 0=16, 1=16
> MB_MAX: 0=32, 1=32
>
> Which might then read back as follows:
>
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=16, 1=16
> # MB_MAX: 0=32, 1=32
>
>
> I haven't tried to develop this idea further, for now.
>
> I'd be interested in people's thoughts on it, though.
Applying this to Intel upcoming region aware memory bandwidth
that supports 255 steps and h/w min/max limits.
We would have info files with "min = 1, max = 255" and a schemata
file that looks like this to legacy apps:
MB: 0=50;1=75
#MB_HW: 0=128;1=191
#MB_MIN: 0=128;1=191
#MB_MAX: 0=128;1=191
But a newer app that is aware of the extensions can write:
# cat > schemata << 'EOF'
MB_HW: 0=10
MB_MIN: 0=10
MB_MAX: 0=64
EOF
which then reads back as:
MB: 0=4;1=75
#MB_HW: 0=10;1=191
#MB_MIN: 0=10;1=191
#MB_MAX: 0=64;1=191
with the legacy line updated with the rounded value of the MB_HW
supplied by the user. 10/255 = 3.921% ... so call it "4".
The region aware h/w supports separate bandwidth controls for each
region. We could hope (or perhaps update the spec to define) that
region0 is always node-local DDR memory and keep the "MB" tag for
that.
Then use some other tag naming for other regions. Remote DDR,
local CXL, remote CXL are the ones we think are next in the h/w
memory sequence. But the "region" concept would allow for other
options as other memory technologies come into use.
>
> Cheers
> ---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-22 15:04 ` Dave Martin
2025-09-25 22:58 ` Luck, Tony
@ 2025-09-26 20:54 ` Reinette Chatre
2025-09-29 13:40 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-09-26 20:54 UTC (permalink / raw)
To: Dave Martin, linux-kernel
Cc: Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86, linux-doc
Hi Dave,
Just one correction ...
On 9/22/25 8:04 AM, Dave Martin wrote:
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>
> [...]
>
>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>> code, for now.
>>
>> Sounds like MPAM may be ready to start the schema parsing discussion again?
>> I understand that MPAM has a few more ways to describe memory bandwidth as
>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing
>> schema format to user space, which seems like a good idea for new schema.
>
> On this topic, specifically:
>
>
> My own ideas in this area are a little different, though I agree with
> the general idea.
>
> Bitmap controls are distinct from numeric values, but for numbers, I'm
> not sure that distinguishing percentages from other values is required,
> since this is really just a specific case of a linear scale.
>
> I imagined a generic numeric schema, described by a set of files like
> the following in a schema's info directory:
>
> min: minimum value, e.g., 1
> max: maximum value, e.g., 1023
> scale: value that corresponds to one unit
> unit: quantified base unit, e.g., "100pc", "64MBps"
> map: mapping function name
>
> If s is the value written in a schemata entry and p is the
> corresponding physical amount of resource, then
>
> min <= s <= max
>
> and
>
> p = map(s / scale) * unit
>
> One reason why I prefer this scaling scheme over the floating-point
> approach is that it can be exact (at least for currently known
> platforms), and it doesn't require a new floating-point parser/
> formatter to be written for this one thing in the kernel (which I
> suspect is likely to be error-prone and poorly defined around
> subtleties such as rounding behaviour).
>
> "map" anticipates non-linear ramps, but this is only really here as a
> forwards compatibility get-out. For now, this might just be set to
> "none", meaning the identity mapping (i.e., a no-op). This may shadow
> the existing the "delay_linear" parameter, but with more general
> applicabillity if we need it.
>
>
> The idea is that userspace reads the info files and then does the
> appropriate conversions itself. This might or might not be seen as a
> burden, but would give exact control over the hardware configuration
> with a generic interface, with possibly greater precision than the
> existing schemata allow (when the hardware supports it), and without
> having to second-guess the rounding that the kernel may or may not do
> on the values.
>
> For RDT MBA, we might have
>
> min: 10
> max: 100
> scale: 100
> unit: 100pc
> map: none
>
> The schemata entry
>
> MB: 0=10, 1=100
>
> would allocate the minimum possible bandwidth to domain 0, and 100%
> bandwidth to domain 1.
>
>
> For AMD SMBA, we might have:
>
> min: 1
> max: 100
> scale: 8
> unit: 1GBps
>
Unfortunately not like this for AMD. Initial support for AMD MBA set max
to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware.
Of course this broke resctrl as a generic interface and I hope we learned
enough since to not repeat this mistake nor give up on MB and make its interface
even worse by, for example, adding more architecture specific input ranges.
Reinette
[1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
[2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit")
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-25 22:58 ` Luck, Tony
@ 2025-09-29 9:19 ` Chen, Yu C
2025-09-29 14:13 ` Dave Martin
2025-09-29 13:56 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Chen, Yu C @ 2025-09-29 9:19 UTC (permalink / raw)
To: Luck, Tony
Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc, Dave Martin
On 9/26/2025 6:58 AM, Luck, Tony wrote:
> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
>> Hi again,
>>
>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>>
>> [...]
>>
[snip]
>> For example, userspace might write the following:
>>
>> MB_MIN: 0=16, 1=16
>> MB_MAX: 0=32, 1=32
>>
>> Which might then read back as follows:
>>
>> MB: 0=50, 1=50
>> # MB_HW: 0=32, 1=32
>> # MB_MIN: 0=16, 1=16
>> # MB_MAX: 0=32, 1=32
>>
>>
>> I haven't tried to develop this idea further, for now.
>>
>> I'd be interested in people's thoughts on it, though.
>
> Applying this to Intel upcoming region aware memory bandwidth
> that supports 255 steps and h/w min/max limits.
> We would have info files with "min = 1, max = 255" and a schemata
> file that looks like this to legacy apps:
>
> MB: 0=50;1=75
> #MB_HW: 0=128;1=191
> #MB_MIN: 0=128;1=191
> #MB_MAX: 0=128;1=191
>
> But a newer app that is aware of the extensions can write:
>
> # cat > schemata << 'EOF'
> MB_HW: 0=10
> MB_MIN: 0=10
> MB_MAX: 0=64
> EOF
>
> which then reads back as:
> MB: 0=4;1=75
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
>
> with the legacy line updated with the rounded value of the MB_HW
> supplied by the user. 10/255 = 3.921% ... so call it "4".
>
This seems to be applicable as it introduces the new interface
while preserving forward compatibility.
One minor question is that, according to "Figure 6-5. MBA Optimal
Bandwidth Register" in the latest RDT specification, the maximum
value ranges from 1 to 511.
Additionally, this bandwidth field is located at bits 48 to 56 in
the MBA Optimal Bandwidth Register, and the range for
this segment could be 1 to 8191. Just wonder if it would be
possible that the current maximum value of 512 may be extended
in the future? Perhaps we could explore a method to query the maximum
upper limit from the ACPI table or register, or use CPUID to distinguish
between platforms rather than hardcoding it. Reinette also mentioned
this in another thread.
Thanks,
Chenyu
[1]
https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html
> The region aware h/w supports separate bandwidth controls for each
> region. We could hope (or perhaps update the spec to define) that
> region0 is always node-local DDR memory and keep the "MB" tag for
> that.
>
> Then use some other tag naming for other regions. Remote DDR,
> local CXL, remote CXL are the ones we think are next in the h/w
> memory sequence. But the "region" concept would allow for other
> options as other memory technologies come into use.
>
>>
>> Cheers
>> ---Dave
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-25 20:53 ` Reinette Chatre
2025-09-25 21:35 ` Luck, Tony
@ 2025-09-29 12:43 ` Dave Martin
2025-09-29 15:38 ` Reinette Chatre
1 sibling, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-09-29 12:43 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/25/25 5:46 AM, Dave Martin wrote:
> > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
> >> On 9/22/25 7:39 AM, Dave Martin wrote:
> >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
> >>>> Hi Dave,
[...]
> >>>> Also please use upper case for acronym mba->MBA.
> >>>
> >>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
> >>> the meaning is not quite the same -- I'll change it.)
> >>
> >> I am curious what the motivation is for the custom? Knowing this will help
> >> me to keep things consistent when the two worlds meet.
> >
> > I think this has just evolved over time. On the x86 side, MBA is a
> > specific architectural feature, but on the MPAM side the architecture
> > doesn't really have a name for the same thing. Memory bandwidth is a
> > concept, but a few different types of control are defined for it, with
> > different names.
> >
> > So, for the MPAM driver "mba" is more of a software concept than
> > something in a published spec: it's the glue that attaches to "MB"
> > resource as seen through resctrl.
> >
> > (This isn't official though; it's just the mental model that I have
> > formed.)
>
> I see. Thank you for the details. My mental model is simpler: write acronyms
> in upper case.
Generally, I agree, although I'm not sure whether that acronym belongs
in the MPAM-specific code.
For this patch, though, that's irrelevant. I've changed it to "MBA"
as requested.
[...]
> >> really sound as though the current interface works that great for MPAM. If I
> >> understand correctly this patch enables MPAM to use existing interface for
> >> its memory bandwidth allocations but doing so does not enable users to
> >> obtain benefit of hardware capabilities. For that users would want to use
> >> the new interface?
> >
> > In ideal world, probably, yes.
> >
> > Since not all use cases will care about full precision, the MB resource
> > (approximated for MPAM) should be fine for a lot of people, but I
> > expect that sooner or later somebody will want more exact control.
>
> ack.
OK.
[,,,]
> >> Considering the two statements:
> >> - "The available steps are no larger than this value."
> >> - "this value ... is not smaller than the apparent size of any individual rounding step"
> >>
> >> The "not larger" and "not smaller" sounds like all these words just end up saying that
> >> this is the step size?
> >
> > They are intended to be the same statement: A <= B versus
> > B >= A respectively.
>
> This is what I understood from the words ... and that made me think that it
> can be simplified to A = B ... but no need to digress ... onto the alternatives below ...
Right...
[...]
> > Instead, maybe we can just say something like:
> >
> > | The available steps are spaced at roughly equal intervals between the
> > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
> > | info/MB/bandwidth_gran gives the worst-case precision of these
> > | interval steps, in per cent.
> >
> > What do you think?
>
> I find "worst-case precision" a bit confusing, consider for example, what
> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
> the upper limit of these interval steps"? I believe this matches what you
> mentioned a couple of messages ago: "The available steps are no larger than this
> value."
Yes, that works. "Worst case" implies a value judgement that smaller
steps are "better" then large steps, since the goal is control.
But your wording, to the effect that this is the largest (apparent)
step size, conveys all the needed information.
> (and "per cent" -> "percent")
( Note: https://en.wiktionary.org/wiki/per_cent )
(Though either is acceptable, the fused word has a more informal feel
to it for me. Happy to change it -- though your rewording below gets
rid of it anyway. (This word doesn't appear in resctrl.rst --
evertying is "percentage" etc.)
>
> > If that's adequate, then the wording under the definition of
> > "bandwidth_gran" could be aligned with this.
>
> I think putting together a couple of your proposals and statements while making the
> text more accurate may work:
>
> "bandwidth_gran":
> The approximate granularity in which the memory bandwidth
> percentage is allocated. The allocated bandwidth percentage
> is rounded up to the next control step available on the
> hardware. The available hardware steps are no larger than
> this value.
That's better, thanks. I'm happy to pick this up and reword the text
in both places along these lines.
> I assume "available" is needed because, even though the steps are not larger
> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
> to 100% range?
Yes -- or, rather, the steps _look_ inconsistent because they are
rounded to exact percentages by the interface.
I don't think we expect the actual steps in the hardware to be
irregular.
[...]
> Reinette
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-25 22:18 ` Reinette Chatre
@ 2025-09-29 13:08 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-29 13:08 UTC (permalink / raw)
To: Reinette Chatre, Luck, Tony
Cc: linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86, linux-doc
Hi Reinette, Tony,
On Thu, Sep 25, 2025 at 03:18:51PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 9/25/25 2:35 PM, Luck, Tony wrote:
[...]
> > Director Technology (Intel® RDT) Architecture Specification"
> >
> > https://cdrdv2.intel.com/v1/dl/getContent/789566
> >
> > describes the upcoming region aware memory bandwidth allocation
> > controls as being a number from "1" to "Q" (enumerated in an ACPI
> > table). First implementation looks like Q == 255 which means a
> > granularity of 0.392% The spec has headroom to allow Q == 511.
That does look like it would benefit from exposing the hardware field
without rounding (similarly as for MPAM).
Is the relationship between this value and the expected memory system
throughput actually defined anywhere?
If the expected throughput is exactly proportional to this value, or a
reasonable approximation to this, then that it simple -- but I can't
see it actually stated.
when a spec suggests a need to divide by (2^N - 1), I do wonder whether
that it what they _really_ meant (and whether hardware will just do the
obvious cheap approximation in defiance of the spec).
> >
> > I don't expect users to need that granularity at the high bandwidth
> > end of the range, but I do expect them to care for highly throttled
> > background/batch jobs to make sure they can't affect performance of
> > the high priority jobs.
A case where it _might_ matter is where there is a non-trivial number
of jobs, and an attempt is made to share bandwidth among them.
Although it may not matter exactly how much bandwidth is given to each
job, the rounding errors may accumulate so that they add up to
significantly more than or less than 100% in total. This feels
undesirable.
Rounding off the value in the interface effectively makes it impossible
for portable software to avoid this problem...
> > I'd hate to have to round all low bandwidth controls to 1% steps.
+1! (No pun intended.)
> This is the limitation if choosing to expose this feature as an MB resource
> and seems to be the same problem that Dave is facing. For finer granularity
> allocations I expect that we would need a new schema/resource backed by new
> properties as proposed by Dave in
> https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
> This will require updates to user space (that will anyway be needed if wedging
> another non-ABI input into MB).
>
> Reinette
Ack; while we could add decimal places to bandwidth_gran as reported to
userspace, we don't know that software isn't going to choke on that.
Plus, we could need to add precision to the control values too --
it's no good advertising 0.5% guanularity when the MB schema only
accepts/reports integers.
Software that parses anything as (potentially) a real number might work
transparently, but we didn't warn users that they might need to do
that...
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-26 20:54 ` Reinette Chatre
@ 2025-09-29 13:40 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-29 13:40 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Fri, Sep 26, 2025 at 01:54:10PM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> Just one correction ...
>
> On 9/22/25 8:04 AM, Dave Martin wrote:
[...0
> > For AMD SMBA, we might have:
> >
> > min: 1
> > max: 100
> > scale: 8
> > unit: 1GBps
> >
>
> Unfortunately not like this for AMD. Initial support for AMD MBA set max
> to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware.
> Of course this broke resctrl as a generic interface and I hope we learned
> enough since to not repeat this mistake nor give up on MB and make its interface
> even worse by, for example, adding more architecture specific input ranges.
>
> Reinette
>
> [1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
> [2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit")
The "100" was just picked randomly in my example. Looking more
carefully at the spec, it may make sense to have:
min: 1
max: (1 << value of BW_LEN)
scale: 8
unit: 1GBps
(This max value correspondings to setting the "unlimited" bit in the
control MSR; the other bits of the bandwidth value are then ignored.
For this instance of the schema, programming the "max" value would be
expected to give the nearest approximation to unlimited bandwidth that
the hardware permits.)
While the memory system is under-utilised end-to-end, I would expect
throughput from a memory-bound job to scale linearly with the control
value, but the control level at which throughput starts to saturate
will depend on the pattern of load throughout the system.
This seems fundamentally different from percentage controls -- it looks
impossible to simulate proportional controls with absolute throughput
controls, nor vice-versa (?)
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-25 22:58 ` Luck, Tony
2025-09-29 9:19 ` Chen, Yu C
@ 2025-09-29 13:56 ` Dave Martin
2025-09-29 16:09 ` Reinette Chatre
2025-09-29 16:37 ` Luck, Tony
1 sibling, 2 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-29 13:56 UTC (permalink / raw)
To: Luck, Tony
Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Tony,
Thanks for taking at look at this -- comments below.
[...]
On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
[...]
> > Would something like the following work? A read from schemata might
> > produce something like this:
> >
> > MB: 0=50, 1=50
> > # MB_HW: 0=32, 1=32
> > # MB_MIN: 0=31, 1=31
> > # MB_MAX: 0=32, 1=32
[...]
> > I'd be interested in people's thoughts on it, though.
>
> Applying this to Intel upcoming region aware memory bandwidth
> that supports 255 steps and h/w min/max limits.
Following the MPAM example, would you also expect:
scale: 255
unit: 100pc
...?
> We would have info files with "min = 1, max = 255" and a schemata
> file that looks like this to legacy apps:
>
> MB: 0=50;1=75
> #MB_HW: 0=128;1=191
> #MB_MIN: 0=128;1=191
> #MB_MAX: 0=128;1=191
>
> But a newer app that is aware of the extensions can write:
>
> # cat > schemata << 'EOF'
> MB_HW: 0=10
> MB_MIN: 0=10
> MB_MAX: 0=64
> EOF
>
> which then reads back as:
> MB: 0=4;1=75
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
>
> with the legacy line updated with the rounded value of the MB_HW
> supplied by the user. 10/255 = 3.921% ... so call it "4".
I'm suggesting that this always be rounded up, so that you have a
guarantee that the steps are no smaller than the reported value.
(In this case, round-up and round-to-nearest give the same answer
anyway, though!)
>
> The region aware h/w supports separate bandwidth controls for each
> region. We could hope (or perhaps update the spec to define) that
> region0 is always node-local DDR memory and keep the "MB" tag for
> that.
Do you have concerns about existing software choking on the #-prefixed
lines?
> Then use some other tag naming for other regions. Remote DDR,
> local CXL, remote CXL are the ones we think are next in the h/w
> memory sequence. But the "region" concept would allow for other
> options as other memory technologies come into use.
Would it be reasnable just to have a set of these schema instances, per
region, so:
MB_HW: ... // implicitly region 0
MB_HW_1: ...
MB_HW_2: ...
etc.
Or, did you have something else in mind?
My thinking is that we avoid adding complexity in the schemata file if
we treat mapping these schema instances onto the hardware topology as
an orthogonal problem. So long as we have unique names in the schemata
file, we can describe elsewhere what they relate to in the hardware.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 9:19 ` Chen, Yu C
@ 2025-09-29 14:13 ` Dave Martin
2025-09-29 16:23 ` Luck, Tony
2025-09-30 4:43 ` Chen, Yu C
0 siblings, 2 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-29 14:13 UTC (permalink / raw)
To: Chen, Yu C
Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, linux-doc
Hi there,
On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote:
> On 9/26/2025 6:58 AM, Luck, Tony wrote:
[...]
> > Applying this to Intel upcoming region aware memory bandwidth
> > that supports 255 steps and h/w min/max limits.
> > We would have info files with "min = 1, max = 255" and a schemata
> > file that looks like this to legacy apps:
> >
> > MB: 0=50;1=75
> > #MB_HW: 0=128;1=191
> > #MB_MIN: 0=128;1=191
> > #MB_MAX: 0=128;1=191
> >
> > But a newer app that is aware of the extensions can write:
> >
> > # cat > schemata << 'EOF'
> > MB_HW: 0=10
> > MB_MIN: 0=10
> > MB_MAX: 0=64
> > EOF
> >
> > which then reads back as:
> > MB: 0=4;1=75
> > #MB_HW: 0=10;1=191
> > #MB_MIN: 0=10;1=191
> > #MB_MAX: 0=64;1=191
> >
> > with the legacy line updated with the rounded value of the MB_HW
> > supplied by the user. 10/255 = 3.921% ... so call it "4".
> >
>
> This seems to be applicable as it introduces the new interface
> while preserving forward compatibility.
>
> One minor question is that, according to "Figure 6-5. MBA Optimal
> Bandwidth Register" in the latest RDT specification, the maximum
> value ranges from 1 to 511.
> Additionally, this bandwidth field is located at bits 48 to 56 in
> the MBA Optimal Bandwidth Register, and the range for
> this segment could be 1 to 8191. Just wonder if it would be
> possible that the current maximum value of 512 may be extended
> in the future? Perhaps we could explore a method to query the maximum upper
> limit from the ACPI table or register, or use CPUID to distinguish between
> platforms rather than hardcoding it. Reinette also mentioned this in another
> thread.
>
> Thanks,
> Chenyu
>
>
> [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html
I can't comment on the direction of travel in the RDT architecture.
I guess it would be up to the arch code whether to trust ACPI if it
says that the maximum value of this field is > 511. (> 65535 would be
impossible though, since the fields would start to overlap each
other...)
Would anything break in the interface proposed here, if the maximum
value is larger than 511? (I'm hoping not. For MPAM, the bandwidth
controls can have up to 16 bits and the size can be probed though MMIO
registers.
I don't think we've seen MPAM hardware that comes close to 16 bits for
now, though.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 12:43 ` Dave Martin
@ 2025-09-29 15:38 ` Reinette Chatre
2025-09-29 16:10 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-09-29 15:38 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 9/29/25 5:43 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 9/25/25 5:46 AM, Dave Martin wrote:
>>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote:
>>>> On 9/22/25 7:39 AM, Dave Martin wrote:
>>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>>>>>> Hi Dave,
>
> [...]
>
>>>>>> Also please use upper case for acronym mba->MBA.
>>>>>
>>>>> Ack (the local custom in the MPAM code is to use "mba", but arguably,
>>>>> the meaning is not quite the same -- I'll change it.)
>>>>
>>>> I am curious what the motivation is for the custom? Knowing this will help
>>>> me to keep things consistent when the two worlds meet.
>>>
>>> I think this has just evolved over time. On the x86 side, MBA is a
>>> specific architectural feature, but on the MPAM side the architecture
>>> doesn't really have a name for the same thing. Memory bandwidth is a
>>> concept, but a few different types of control are defined for it, with
>>> different names.
>>>
>>> So, for the MPAM driver "mba" is more of a software concept than
>>> something in a published spec: it's the glue that attaches to "MB"
>>> resource as seen through resctrl.
>>>
>>> (This isn't official though; it's just the mental model that I have
>>> formed.)
>>
>> I see. Thank you for the details. My mental model is simpler: write acronyms
>> in upper case.
>
> Generally, I agree, although I'm not sure whether that acronym belongs
> in the MPAM-specific code.
>
> For this patch, though, that's irrelevant. I've changed it to "MBA"
> as requested.
>
Thank you.
...
>>>> Considering the two statements:
>>>> - "The available steps are no larger than this value."
>>>> - "this value ... is not smaller than the apparent size of any individual rounding step"
>>>>
>>>> The "not larger" and "not smaller" sounds like all these words just end up saying that
>>>> this is the step size?
>>>
>>> They are intended to be the same statement: A <= B versus
>>> B >= A respectively.
>>
>> This is what I understood from the words ... and that made me think that it
>> can be simplified to A = B ... but no need to digress ... onto the alternatives below ...
>
> Right...
>
> [...]
>
>>> Instead, maybe we can just say something like:
>>>
>>> | The available steps are spaced at roughly equal intervals between the
>>> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading
>>> | info/MB/bandwidth_gran gives the worst-case precision of these
>>> | interval steps, in per cent.
>>>
>>> What do you think?
>>
>> I find "worst-case precision" a bit confusing, consider for example, what
>> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
>> the upper limit of these interval steps"? I believe this matches what you
>> mentioned a couple of messages ago: "The available steps are no larger than this
>> value."
>
> Yes, that works. "Worst case" implies a value judgement that smaller
> steps are "better" then large steps, since the goal is control.
>
> But your wording, to the effect that this is the largest (apparent)
> step size, conveys all the needed information.
Thank you for considering it. My preference is for stating things succinctly
and not leave too much for interpretation.
>
>> (and "per cent" -> "percent")
>
> ( Note: https://en.wiktionary.org/wiki/per_cent )
Yes, in particular I note the "chiefly Commonwealth". I respect the differences
in the English language and was easily convinced in earlier MPAM work to
accept different spelling. I now regret doing so because after merge we now have a
supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant
to encounter when reading the code and also complicates new work.
> (Though either is acceptable, the fused word has a more informal feel
> to it for me. Happy to change it -- though your rewording below gets
> rid of it anyway. (This word doesn't appear in resctrl.rst --
> evertying is "percentage" etc.)
>
>>
>>> If that's adequate, then the wording under the definition of
>>> "bandwidth_gran" could be aligned with this.
>>
>> I think putting together a couple of your proposals and statements while making the
>> text more accurate may work:
>>
>> "bandwidth_gran":
>> The approximate granularity in which the memory bandwidth
>> percentage is allocated. The allocated bandwidth percentage
>> is rounded up to the next control step available on the
>> hardware. The available hardware steps are no larger than
>> this value.
>
> That's better, thanks. I'm happy to pick this up and reword the text
> in both places along these lines.
Thank you very much. Please do feel free to rework.
>
>> I assume "available" is needed because, even though the steps are not larger
>> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
>> to 100% range?
>
> Yes -- or, rather, the steps _look_ inconsistent because they are
> rounded to exact percentages by the interface.
>
> I don't think we expect the actual steps in the hardware to be
> irregular.
>
Thank you for clarifying.
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 13:56 ` Dave Martin
@ 2025-09-29 16:09 ` Reinette Chatre
2025-09-30 15:40 ` Dave Martin
2025-09-29 16:37 ` Luck, Tony
1 sibling, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-09-29 16:09 UTC (permalink / raw)
To: Dave Martin, Luck, Tony
Cc: linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
x86, linux-doc
Hi Dave,
On 9/29/25 6:56 AM, Dave Martin wrote:
> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
>> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
...
>> The region aware h/w supports separate bandwidth controls for each
>> region. We could hope (or perhaps update the spec to define) that
>> region0 is always node-local DDR memory and keep the "MB" tag for
>> that.
>
> Do you have concerns about existing software choking on the #-prefixed
> lines?
I am trying to understand the purpose of the #-prefix. I see two motivations
for the #-prefix with the primary point that multiple schema apply to the same
resource.
1) Commented schema are "inactive"
This is unclear to me. In the MB example the commented lines show the
finer grained controls. Since the original MB resource is an approximation
and the hardware must already be configured to support it, would the #-prefixed
lines not show the actual "active" configuration?
2) Commented schema are "conflicting"
The original proposal mentioned "write them back instead of (or in addition to)
the conflicting entries". I do not know how resctrl will be able to
handle a user requesting a change to both "MB" and "MB_HW". This seems like
something that should fail?
On a high level it is not clear to me why the # prefix is needed. As I understand the
schemata names will always be unique and the new features made backward
compatible to existing schemata names. That is, existing MB, L3, etc.
will also have the new info files that describe their values/ranges.
I expect that user space will ignore schema that it is not familiar
with so the # prefix seems unnecessary?
I believe the motivation is to express a relationship between different
schema (you mentioned "shadow" initially). I think this relationship can
be expressed clearly by using a namespace prefix (like "MB_" in the examples).
This may help more when there are multiple schemata with this format where a #-prefix
does not make obvious which resource is shadowed.
>> Then use some other tag naming for other regions. Remote DDR,
>> local CXL, remote CXL are the ones we think are next in the h/w
>> memory sequence. But the "region" concept would allow for other
>> options as other memory technologies come into use.
>
> Would it be reasnable just to have a set of these schema instances, per
> region, so:
>
> MB_HW: ... // implicitly region 0
> MB_HW_1: ...
> MB_HW_2: ...
>
> etc.
>
> Or, did you have something else in mind?
>
> My thinking is that we avoid adding complexity in the schemata file if
> we treat mapping these schema instances onto the hardware topology as
> an orthogonal problem. So long as we have unique names in the schemata
> file, we can describe elsewhere what they relate to in the hardware.
Agreed ... and "elsewhere" is expected to be unique depending on the resource.
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 15:38 ` Reinette Chatre
@ 2025-09-29 16:10 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-29 16:10 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Mon, Sep 29, 2025 at 08:38:13AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/29/25 5:43 AM, Dave Martin wrote:
[...]
> > Generally, I agree, although I'm not sure whether that acronym belongs
> > in the MPAM-specific code.
> >
> > For this patch, though, that's irrelevant. I've changed it to "MBA"
> > as requested.
> >
>
> Thank you.
[...]
> >> I find "worst-case precision" a bit confusing, consider for example, what
> >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives
> >> the upper limit of these interval steps"? I believe this matches what you
> >> mentioned a couple of messages ago: "The available steps are no larger than this
> >> value."
> >
> > Yes, that works. "Worst case" implies a value judgement that smaller
> > steps are "better" then large steps, since the goal is control.
> >
> > But your wording, to the effect that this is the largest (apparent)
> > step size, conveys all the needed information.
>
> Thank you for considering it. My preference is for stating things succinctly
> and not leave too much for interpretation.
I find that it's not always easy to work out what information is
essential without the discussion... so I hope that didn't feel like a
waste of time!
> >> (and "per cent" -> "percent")
> >
> > ( Note: https://en.wiktionary.org/wiki/per_cent )
>
> Yes, in particular I note the "chiefly Commonwealth". I respect the differences
> in the English language and was easily convinced in earlier MPAM work to
> accept different spelling. I now regret doing so because after merge we now have a
> supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant
> to encounter when reading the code and also complicates new work.
>
> > (Though either is acceptable, the fused word has a more informal feel
> > to it for me. Happy to change it -- though your rewording below gets
> > rid of it anyway. (This word doesn't appear in resctrl.rst --
> > evertying is "percentage" etc.)
Sure, it's best not to have mixed-up conventions in the same document.
(With this one, I wasn't aware that there were regional differences at
all, so that was news to me...)
[...]
> >> I think putting together a couple of your proposals and statements while making the
> >> text more accurate may work:
> >>
> >> "bandwidth_gran":
> >> The approximate granularity in which the memory bandwidth
> >> percentage is allocated. The allocated bandwidth percentage
> >> is rounded up to the next control step available on the
> >> hardware. The available hardware steps are no larger than
> >> this value.
> >
> > That's better, thanks. I'm happy to pick this up and reword the text
> > in both places along these lines.
>
> Thank you very much. Please do feel free to rework.
>
> >
> >> I assume "available" is needed because, even though the steps are not larger
> >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth"
> >> to 100% range?
> >
> > Yes -- or, rather, the steps _look_ inconsistent because they are
> > rounded to exact percentages by the interface.
> >
> > I don't think we expect the actual steps in the hardware to be
> > irregular.
> >
> Thank you for clarifying.
>
> Reinette
OK.
I'll tidy up the loose ends and repost once I've have a chance to re-
test.
Thanks for the review.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 14:13 ` Dave Martin
@ 2025-09-29 16:23 ` Luck, Tony
2025-09-30 11:02 ` Chen, Yu C
2025-09-30 4:43 ` Chen, Yu C
1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2025-09-29 16:23 UTC (permalink / raw)
To: Dave Martin, Chen, Yu C
Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86@kernel.org,
linux-doc@vger.kernel.org
> > This seems to be applicable as it introduces the new interface
> > while preserving forward compatibility.
> >
> > One minor question is that, according to "Figure 6-5. MBA Optimal
> > Bandwidth Register" in the latest RDT specification, the maximum
> > value ranges from 1 to 511.
> > Additionally, this bandwidth field is located at bits 48 to 56 in
> > the MBA Optimal Bandwidth Register, and the range for
> > this segment could be 1 to 8191. Just wonder if it would be
48..56 is still 9 bits, so max value is 511.
> > possible that the current maximum value of 512 may be extended
> > in the future? Perhaps we could explore a method to query the maximum upper
> > limit from the ACPI table or register, or use CPUID to distinguish between
> > platforms rather than hardcoding it. Reinette also mentioned this in another
> > thread.
I think 511 was chosen as "bigger than we expect to ever need" and 9-bits
allocated in the registers based on that.
Initial implementation may use 255 as the maximum - though I'm pushing on
that a bit as the throttle graph at the early stage is fairly linear from "1" to some
value < 255, when bandwidth hits maximum, then flat up to 255.
If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI
table should be the value where peak bandwidth is hit (though this is complicated
because workloads with different mixes of read/write access have different
throttle graphs).
> >
> > Thanks,
> > Chenyu
> >
> >
> > [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html
>
> I can't comment on the direction of travel in the RDT architecture.
>
> I guess it would be up to the arch code whether to trust ACPI if it
> says that the maximum value of this field is > 511. (> 65535 would be
> impossible though, since the fields would start to overlap each
> other...)
resctrl should do some sanity checks on values it sees in the ACPI
tables. Linux has:
#define FW_BUG "[Firmware Bug]: "
#define FW_WARN "[Firmware Warn]: "
#define FW_INFO "[Firmware Info]: "
for good historical reasons.
>
> Would anything break in the interface proposed here, if the maximum
> value is larger than 511? (I'm hoping not. For MPAM, the bandwidth
> controls can have up to 16 bits and the size can be probed though MMIO
> registers.
>
> I don't think we've seen MPAM hardware that comes close to 16 bits for
> now, though.
While kernel code is sometimes space-conserving and uses u8/u16 types
for values that fit in some limited range, I'd expect user applications that
read the "info" files and program the "schemata" files to not care.
Python integers have arbitrary precision, so would be just fine with:
max 340282366920938463463374607431768211455
:-)
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 13:56 ` Dave Martin
2025-09-29 16:09 ` Reinette Chatre
@ 2025-09-29 16:37 ` Luck, Tony
2025-09-30 16:02 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2025-09-29 16:37 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote:
> Hi Tony,
>
> Thanks for taking at look at this -- comments below.
>
> [...]
>
> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
>
> [...]
>
> > > Would something like the following work? A read from schemata might
> > > produce something like this:
> > >
> > > MB: 0=50, 1=50
> > > # MB_HW: 0=32, 1=32
> > > # MB_MIN: 0=31, 1=31
> > > # MB_MAX: 0=32, 1=32
>
> [...]
>
> > > I'd be interested in people's thoughts on it, though.
> >
> > Applying this to Intel upcoming region aware memory bandwidth
> > that supports 255 steps and h/w min/max limits.
>
> Following the MPAM example, would you also expect:
>
> scale: 255
> unit: 100pc
>
> ...?
Yes. 255 (or whatever "Q" value is provided in the ACPI table)
corresponds to no throttling, so 100% bandwidth.
>
> > We would have info files with "min = 1, max = 255" and a schemata
> > file that looks like this to legacy apps:
> >
> > MB: 0=50;1=75
> > #MB_HW: 0=128;1=191
> > #MB_MIN: 0=128;1=191
> > #MB_MAX: 0=128;1=191
> >
> > But a newer app that is aware of the extensions can write:
> >
> > # cat > schemata << 'EOF'
> > MB_HW: 0=10
> > MB_MIN: 0=10
> > MB_MAX: 0=64
> > EOF
> >
> > which then reads back as:
> > MB: 0=4;1=75
> > #MB_HW: 0=10;1=191
> > #MB_MIN: 0=10;1=191
> > #MB_MAX: 0=64;1=191
> >
> > with the legacy line updated with the rounded value of the MB_HW
> > supplied by the user. 10/255 = 3.921% ... so call it "4".
>
> I'm suggesting that this always be rounded up, so that you have a
> guarantee that the steps are no smaller than the reported value.
Round up, rather than round-to-nearest, make sense. Though perhaps
only cosmetic as I would be surprised if anyone has a mix of tools
looking at the legacy schemata lines while programming using the
direct h/w controls.
>
> (In this case, round-up and round-to-nearest give the same answer
> anyway, though!)
>
> >
> > The region aware h/w supports separate bandwidth controls for each
> > region. We could hope (or perhaps update the spec to define) that
> > region0 is always node-local DDR memory and keep the "MB" tag for
> > that.
>
> Do you have concerns about existing software choking on the #-prefixed
> lines?
Do they even need a # prefix? We already mix lines for multiple
resources in the schemata file with a separate prefix for each resource.
The schemata file also allows writes to just update one resource (or
one domain in a single resource). The schemata file started with just
"L3". Then we added "L2", "MB", and "SMBA" with no concern that the
initial "L3" manipulating tools would be confused.
> > Then use some other tag naming for other regions. Remote DDR,
> > local CXL, remote CXL are the ones we think are next in the h/w
> > memory sequence. But the "region" concept would allow for other
> > options as other memory technologies come into use.
>
> Would it be reasnable just to have a set of these schema instances, per
> region, so:
>
> MB_HW: ... // implicitly region 0
> MB_HW_1: ...
> MB_HW_2: ...
Chen Yu is currently looking at putting the word "TIER" into the
name, since there's some precedent for describing memory in "tiers".
Whatever naming scheme is used, the important part is how will users
find out what each schemata line actually means/controls.
> etc.
>
> Or, did you have something else in mind?
>
> My thinking is that we avoid adding complexity in the schemata file if
> we treat mapping these schema instances onto the hardware topology as
> an orthogonal problem. So long as we have unique names in the schemata
> file, we can describe elsewhere what they relate to in the hardware.
Yes, exactly this.
>
> Cheers
> ---Dave
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 14:13 ` Dave Martin
2025-09-29 16:23 ` Luck, Tony
@ 2025-09-30 4:43 ` Chen, Yu C
2025-09-30 15:55 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Chen, Yu C @ 2025-09-30 4:43 UTC (permalink / raw)
To: Dave Martin, Luck, Tony
Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
On 9/29/2025 10:13 PM, Dave Martin wrote:
> Hi there,
>
> On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote:
>> On 9/26/2025 6:58 AM, Luck, Tony wrote:
>
> [...]
>
>>> Applying this to Intel upcoming region aware memory bandwidth
>>> that supports 255 steps and h/w min/max limits.
>>> We would have info files with "min = 1, max = 255" and a schemata
>>> file that looks like this to legacy apps:
>>>
>>> MB: 0=50;1=75
>>> #MB_HW: 0=128;1=191
>>> #MB_MIN: 0=128;1=191
>>> #MB_MAX: 0=128;1=191
>>>
>>> But a newer app that is aware of the extensions can write:
>>>
>>> # cat > schemata << 'EOF'
>>> MB_HW: 0=10
>>> MB_MIN: 0=10
>>> MB_MAX: 0=64
>>> EOF
>>>
>>> which then reads back as:
>>> MB: 0=4;1=75
>>> #MB_HW: 0=10;1=191
>>> #MB_MIN: 0=10;1=191
>>> #MB_MAX: 0=64;1=191
>>>
>>> with the legacy line updated with the rounded value of the MB_HW
>>> supplied by the user. 10/255 = 3.921% ... so call it "4".
>>>
>>
>> This seems to be applicable as it introduces the new interface
>> while preserving forward compatibility.
>>
>> One minor question is that, according to "Figure 6-5. MBA Optimal
>> Bandwidth Register" in the latest RDT specification, the maximum
>> value ranges from 1 to 511.
>> Additionally, this bandwidth field is located at bits 48 to 56 in
>> the MBA Optimal Bandwidth Register, and the range for
>> this segment could be 1 to 8191. Just wonder if it would be
>> possible that the current maximum value of 512 may be extended
>> in the future? Perhaps we could explore a method to query the maximum upper
>> limit from the ACPI table or register, or use CPUID to distinguish between
>> platforms rather than hardcoding it. Reinette also mentioned this in another
>> thread.
>>
>> Thanks,
>> Chenyu
>>
>>
>> [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html
>
> I can't comment on the direction of travel in the RDT architecture.
>
> I guess it would be up to the arch code whether to trust ACPI if it
> says that the maximum value of this field is > 511. (> 65535 would be
> impossible though, since the fields would start to overlap each
> other...)
>
> Would anything break in the interface proposed here, if the maximum
> value is larger than 511? (I'm hoping not. For MPAM, the bandwidth
> controls can have up to 16 bits and the size can be probed though MMIO
> registers.
>
I overlooked this bit width. It should not exceed 511 according to the
RDT spec. Previously, I was just wondering how to calculate the legacy
MB percentage in Tony's example. If we want to keep consistency - if
the user provides a value of 10, what is the denominator: Is it 255,
511, or something queried from ACPI.
MB: 0=4;1=75 <--- 10/255
#MB_HW: 0=10;1=191
#MB_MIN: 0=10;1=191
#MB_MAX: 0=64;1=191
or
MB: 0=1;1=75 <--- 10/511
#MB_HW: 0=10;1=191
#MB_MIN: 0=10;1=191
#MB_MAX: 0=64;1=191
thanks,
Chenyu
> I don't think we've seen MPAM hardware that comes close to 16 bits for
> now, though.
>
> Cheers
> ---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 16:23 ` Luck, Tony
@ 2025-09-30 11:02 ` Chen, Yu C
2025-09-30 16:08 ` Luck, Tony
0 siblings, 1 reply; 52+ messages in thread
From: Chen, Yu C @ 2025-09-30 11:02 UTC (permalink / raw)
To: Luck, Tony
Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86@kernel.org,
linux-doc@vger.kernel.org, Dave Martin
On 9/30/2025 12:23 AM, Luck, Tony wrote:
>>> This seems to be applicable as it introduces the new interface
>>> while preserving forward compatibility.
>>>
>>> One minor question is that, according to "Figure 6-5. MBA Optimal
>>> Bandwidth Register" in the latest RDT specification, the maximum
>>> value ranges from 1 to 511.
>>> Additionally, this bandwidth field is located at bits 48 to 56 in
>>> the MBA Optimal Bandwidth Register, and the range for
>>> this segment could be 1 to 8191. Just wonder if it would be
>
> 48..56 is still 9 bits, so max value is 511.
>
Ah I see, I overlooked this.
>>> possible that the current maximum value of 512 may be extended
>>> in the future? Perhaps we could explore a method to query the maximum upper
>>> limit from the ACPI table or register, or use CPUID to distinguish between
>>> platforms rather than hardcoding it. Reinette also mentioned this in another
>>> thread.
>
> I think 511 was chosen as "bigger than we expect to ever need" and 9-bits
> allocated in the registers based on that.
>
OK, got it.
> Initial implementation may use 255 as the maximum - though I'm pushing on
> that a bit as the throttle graph at the early stage is fairly linear from "1" to some
> value < 255,
> when bandwidth hits maximum, then flat up to 255.
> If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI
> table should be the value where peak bandwidth is hit
I see. If I understand correctly, the BIOS needs to pre-train the system to
find this Q. However, if the BIOS cannot provide this Q, would it be
feasible
for the user to provide it? For example, the user could saturate the memory
bandwidth, gradually increase MB_MAX, and finally find the Q_max where the
memory bandwidth no longer increases. The user could then adjust the max
field in the info file.
> (though this is complicated
> because workloads with different mixes of read/write access have different
> throttle graphs).
>
Does this mean read and write operations have different Q values to saturate
the memory bandwidth? For example, if the workload is all reads, there
is a Q_r;
if the workload is all writes, there is another Q_w. In that case, maybe we
could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)).
thanks,
Chenyu
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 16:09 ` Reinette Chatre
@ 2025-09-30 15:40 ` Dave Martin
2025-10-10 16:48 ` Reinette Chatre
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-09-30 15:40 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi,
On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/29/25 6:56 AM, Dave Martin wrote:
> > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> >> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
>
> ...
>
> >> The region aware h/w supports separate bandwidth controls for each
> >> region. We could hope (or perhaps update the spec to define) that
> >> region0 is always node-local DDR memory and keep the "MB" tag for
> >> that.
> >
> > Do you have concerns about existing software choking on the #-prefixed
> > lines?
>
> I am trying to understand the purpose of the #-prefix. I see two motivations
> for the #-prefix with the primary point that multiple schema apply to the same
> resource.
>
> 1) Commented schema are "inactive"
> This is unclear to me. In the MB example the commented lines show the
> finer grained controls. Since the original MB resource is an approximation
> and the hardware must already be configured to support it, would the #-prefixed
> lines not show the actual "active" configuration?
They would show the active configuration (possibly more precisely than
"MB" does).
If not, it's not clear how userspace that is trying to use MB_HW (say)
could read out the current configuration.
The # is intended to make resctrl ignore the lines when the file
is written by userspace. This is done so that userspace has to
actually change those lines in order for them to take effect when
writing. Old userspace can just pass them through without modification,
without anything unexpected happening.
The reason why I think that this convention may be needed is that we
never told (old) userspace what it was supposed to do with schemata
entries that it does not recognise.
> 2) Commented schema are "conflicting"
> The original proposal mentioned "write them back instead of (or in addition to)
> the conflicting entries". I do not know how resctrl will be able to
> handle a user requesting a change to both "MB" and "MB_HW". This seems like
> something that should fail?
If userspace is asking for two incompatible things at the same time, we
can either pick one of them and ignore the rest, or do nothing, or fail
explicitly.
If we think that it doesn't really matter what happens, then resctrl
could just dumbly process the entries in the order given. If the
result is not what userspace wanted, that's not our problem.
(Today, nothing prevents userspace writing multiple "MB" lines at the
same time: resctrl will process them all, but only the final one will
have a lasting effect. So, the fact that a resctrl write can contain
mutually incompatible requests does not seem to be new.)
> On a high level it is not clear to me why the # prefix is needed. As I understand the
> schemata names will always be unique and the new features made backward
> compatible to existing schemata names. That is, existing MB, L3, etc.
> will also have the new info files that describe their values/ranges.
Regarding backwards compatibility for the existing controls:
This proposal is only about numeric controls. L3 wouldn't change, but
we could still add info/ metadata for bitmap control at the same time
as adding it for numeric controls.
MB may be hard to describe in a useful way, though -- at least in the
MPAM case, where the number of steps does not divide into 100, and the
AMD cases where the meaning of the MB control values is different.
MB and MB_HW are not interchangeable. To obtain predictable results
from MB, userspace would need to know precisely how the kernel is going
to round the value. This feels like an implementation detail that
doesn't belong in the ABI.
I suppose we could also add a "granularity" entry in info/, but we have
the existing "bandwidth_gran" file for MB. For any new schema, I don't
think we need to state the granularity: the other parameters can always
be adjusted so that the granularity is exactly 1.
Regarding the "#" (see below):
> I expect that user space will ignore schema that it is not familiar
> with so the # prefix seems unnecessary?
>
> I believe the motivation is to express a relationship between different
> schema (you mentioned "shadow" initially). I think this relationship can
> be expressed clearly by using a namespace prefix (like "MB_" in the examples).
> This may help more when there are multiple schemata with this format where a #-prefix
> does not make obvious which resource is shadowed.
An illustration would probably help, here.
Say resctrl knows has schemata MB, MB_HW, MB_MIN and MB_MAX, all of
which control (aspects of) the same underlying hardware resource.
Reading the schemata file might yield:
MB: 0=29
MB_HW: 0=2
MB_MIN: 0=1
MB_MAX: 0=2
(I assume for this toy example that MB_{HW,MIN,MAX} ranges from
0 to 7.)
Now, suppose (current) userspace wants to change the allocated
bandwidth. It only understands the "MB" line, but it is reasonable to
expect that writing the other lines back without modification will do
nothing. (A human user might read the file, and tweak it through an
editor to modify just the entry of interest, run it through awk, etc.)
So, the user writes back:
MB: 0=43
MB_HW: 0=2
MB_MIN: 0=1
MB_MAX: 0=2
If resctrl just processes the entries in order, it will temporarily
change the bandwidth allocation due to the "MB" row, but will then
immediately change it back again due to the other rows.
Reading schemata again now gives:
MB: 0=29
MB_HW: 0=2
MB_MIN: 0=1
MB_MAX: 0=2
We might be able to solve some problems of this sort be reordering the
entries, but I suspect that some software may give up as soon as it
sees an unfamiliar entry -- so it may be better to keep the classic
entries (like "MB") at the start.
Anyway, going back to the "#" convention:
If the initial read of schemata has the new entries "pre-commented",
then userspace wouldn't need to know about the new entries. It could
just tweak the MB entry (which it knows about), and write the file back:
MB: 0=43
# MB_HW: 0=2
# MB_MIN: 0=1
# MB_MAX: 0=2
then resctrl knows to ignore the hashed lines, and so reading the file
back gives:
MB: 0=43
# MB_HW: 0=3
# MB_MIN: 0=2
# MB_MAX: 0=3
(For hardware-specific reasons, the MPAM driver currently internally
programs the MIN bound to be a bit less than the MAX bound, when
userspace writes an "MB" entry into schemata. The key thing is that
writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
resctrl level, I don't that that we necessarily need to make promises
about what they can change _to_. The exact effect of MIN and MAX
bounds is likely to be hardware-dependent anyway.)
Regarding new userspce:
Going forward, we can explicitly document that there should be no
conflicting or "passenger" entries in a schemata write: don't include
an entry for somehing that you don't explicitly want to set, and if
multiple entries affect the same resource, we don't promise what
happens.
(But sadly, we can't impose that rule on existing software after the
fact.)
One final note: I have not provided any way to indicate that all those
entries control the same hardware resource. The common "MB" prefix is
intended as a clue, but ultimately, userspace needs to know what an
entry controls before tweaking it.
We could try to describe the relationships explicitly, but I'm not sure
that it is useful...
> >> Then use some other tag naming for other regions. Remote DDR,
> >> local CXL, remote CXL are the ones we think are next in the h/w
> >> memory sequence. But the "region" concept would allow for other
> >> options as other memory technologies come into use.
> >
> > Would it be reasnable just to have a set of these schema instances, per
> > region, so:
> >
> > MB_HW: ... // implicitly region 0
> > MB_HW_1: ...
> > MB_HW_2: ...
> >
> > etc.
> >
> > Or, did you have something else in mind?
> >
> > My thinking is that we avoid adding complexity in the schemata file if
> > we treat mapping these schema instances onto the hardware topology as
> > an orthogonal problem. So long as we have unique names in the schemata
> > file, we can describe elsewhere what they relate to in the hardware.
>
> Agreed ... and "elsewhere" is expected to be unique depending on the resource.
>
> Reinette
Yes
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-30 4:43 ` Chen, Yu C
@ 2025-09-30 15:55 ` Dave Martin
2025-10-01 12:13 ` Chen, Yu C
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-09-30 15:55 UTC (permalink / raw)
To: Chen, Yu C
Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, linux-doc
Hi,
On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote:
> On 9/29/2025 10:13 PM, Dave Martin wrote:
[...]
> > I guess it would be up to the arch code whether to trust ACPI if it
> > says that the maximum value of this field is > 511. (> 65535 would be
> > impossible though, since the fields would start to overlap each
> > other...)
> >
> > Would anything break in the interface proposed here, if the maximum
> > value is larger than 511? (I'm hoping not. For MPAM, the bandwidth
> > controls can have up to 16 bits and the size can be probed though MMIO
> > registers.
> >
>
> I overlooked this bit width. It should not exceed 511 according to the
> RDT spec. Previously, I was just wondering how to calculate the legacy
> MB percentage in Tony's example. If we want to keep consistency - if
> the user provides a value of 10, what is the denominator: Is it 255,
> 511, or something queried from ACPI.
>
> MB: 0=4;1=75 <--- 10/255
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
>
> or
>
> MB: 0=1;1=75 <--- 10/511
> #MB_HW: 0=10;1=191
> #MB_MIN: 0=10;1=191
> #MB_MAX: 0=64;1=191
>
> thanks,
> Chenyu
The denomiator (the "scale" parameter in my model, though the name is
unimportant) should be whatever quantity of resource is specified in
the "unit" parameter.
For "percentage" type controls, I'd expect the unit to be 100% ("100pc"
in my syntax).
So, Tony suggestion looks plausible to me [1] :
| Yes. 255 (or whatever "Q" value is provided in the ACPI table)
| corresponds to no throttling, so 100% bandwidth.
So, if ACPI says Q=387, that's the denominator we advertise.
Does that sound right?
Question: is this a global parameter, or per-CPU?
From the v1.2 RDT spec, it looks like it is a single, global parameter.
I hope this is true (!) But I'm not too familiar with these specs...
Cheers
---Dave
[1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-29 16:37 ` Luck, Tony
@ 2025-09-30 16:02 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-09-30 16:02 UTC (permalink / raw)
To: Luck, Tony
Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Tony,
On Mon, Sep 29, 2025 at 09:37:41AM -0700, Luck, Tony wrote:
> On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote:
> > Hi Tony,
> >
> > Thanks for taking at look at this -- comments below.
> >
> > [...]
> >
> > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
> >
> > [...]
> >
> > > > Would something like the following work? A read from schemata might
> > > > produce something like this:
> > > >
> > > > MB: 0=50, 1=50
> > > > # MB_HW: 0=32, 1=32
> > > > # MB_MIN: 0=31, 1=31
> > > > # MB_MAX: 0=32, 1=32
> >
> > [...]
> >
> > > > I'd be interested in people's thoughts on it, though.
> > >
> > > Applying this to Intel upcoming region aware memory bandwidth
> > > that supports 255 steps and h/w min/max limits.
> >
> > Following the MPAM example, would you also expect:
> >
> > scale: 255
> > unit: 100pc
> >
> > ...?
>
> Yes. 255 (or whatever "Q" value is provided in the ACPI table)
> corresponds to no throttling, so 100% bandwidth.
>
> >
> > > We would have info files with "min = 1, max = 255" and a schemata
> > > file that looks like this to legacy apps:
[...]
> > > MB: 0=4;1=75
[...]
> > > with the legacy line updated with the rounded value of the MB_HW
> > > supplied by the user. 10/255 = 3.921% ... so call it "4".
> >
> > I'm suggesting that this always be rounded up, so that you have a
> > guarantee that the steps are no smaller than the reported value.
>
> Round up, rather than round-to-nearest, make sense. Though perhaps
> only cosmetic as I would be surprised if anyone has a mix of tools
> looking at the legacy schemata lines while programming using the
> direct h/w controls.
Ack
[...]
> > Do you have concerns about existing software choking on the #-prefixed
> > lines?
>
> Do they even need a # prefix? We already mix lines for multiple
> resources in the schemata file with a separate prefix for each resource.
> The schemata file also allows writes to just update one resource (or
> one domain in a single resource). The schemata file started with just
> "L3". Then we added "L2", "MB", and "SMBA" with no concern that the
> initial "L3" manipulating tools would be confused.
The "#" thing is for backwards compatibility with old userspace that
might blindly "paste back" unknown entries when writing the schemata
file.
(See also my reply to Reinette [1].)
> > > Then use some other tag naming for other regions. Remote DDR,
> > > local CXL, remote CXL are the ones we think are next in the h/w
> > > memory sequence. But the "region" concept would allow for other
> > > options as other memory technologies come into use.
> >
> > Would it be reasnable just to have a set of these schema instances, per
> > region, so:
> >
> > MB_HW: ... // implicitly region 0
> > MB_HW_1: ...
> > MB_HW_2: ...
>
> Chen Yu is currently looking at putting the word "TIER" into the
> name, since there's some precedent for describing memory in "tiers".
>
> Whatever naming scheme is used, the important part is how will users
> find out what each schemata line actually means/controls.
Agreed. That's a problem, but a separate one.
[...]
> > Or, did you have something else in mind?
> >
> > My thinking is that we avoid adding complexity in the schemata file if
> > we treat mapping these schema instances onto the hardware topology as
> > an orthogonal problem. So long as we have unique names in the schemata
> > file, we can describe elsewhere what they relate to in the hardware.
>
> Yes, exactly this.
OK, that's reassuring.
Cheers
---Dave
[1] https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/
^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-30 11:02 ` Chen, Yu C
@ 2025-09-30 16:08 ` Luck, Tony
0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2025-09-30 16:08 UTC (permalink / raw)
To: Chen, Yu C
Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86@kernel.org,
linux-doc@vger.kernel.org, Dave Martin
> >>> This seems to be applicable as it introduces the new interface
> >>> while preserving forward compatibility.
> >>>
> >>> One minor question is that, according to "Figure 6-5. MBA Optimal
> >>> Bandwidth Register" in the latest RDT specification, the maximum
> >>> value ranges from 1 to 511.
> >>> Additionally, this bandwidth field is located at bits 48 to 56 in
> >>> the MBA Optimal Bandwidth Register, and the range for
> >>> this segment could be 1 to 8191. Just wonder if it would be
> >
> > 48..56 is still 9 bits, so max value is 511.
> >
>
> Ah I see, I overlooked this.
>
> >>> possible that the current maximum value of 512 may be extended
> >>> in the future? Perhaps we could explore a method to query the maximum upper
> >>> limit from the ACPI table or register, or use CPUID to distinguish between
> >>> platforms rather than hardcoding it. Reinette also mentioned this in another
> >>> thread.
> >
> > I think 511 was chosen as "bigger than we expect to ever need" and 9-bits
> > allocated in the registers based on that.
> >
>
> OK, got it.
>
> > Initial implementation may use 255 as the maximum - though I'm pushing on
> > that a bit as the throttle graph at the early stage is fairly linear from "1" to some
> > value < 255,
> > when bandwidth hits maximum, then flat up to 255.
> > If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI
> > table should be the value where peak bandwidth is hit
>
> I see. If I understand correctly, the BIOS needs to pre-train the system to
> find this Q. However, if the BIOS cannot provide this Q, would it be
> feasible
> for the user to provide it? For example, the user could saturate the memory
> bandwidth, gradually increase MB_MAX, and finally find the Q_max where the
> memory bandwidth no longer increases. The user could then adjust the max
> field in the info file.
>
> > (though this is complicated
> > because workloads with different mixes of read/write access have different
> > throttle graphs).
> >
>
> Does this mean read and write operations have different Q values to saturate
> the memory bandwidth? For example, if the workload is all reads, there
> is a Q_r;
> if the workload is all writes, there is another Q_w. In that case, maybe we
> could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)).
If the BIOS doesn't provide a good enough number, then users might well
do some tuning based on the workloads they plan to run and ignore the value
in the info file in favor of one tuned specifically for their workloads. But it is too
early to start guessing at workarounds for problems that may not even exist.
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-30 15:55 ` Dave Martin
@ 2025-10-01 12:13 ` Chen, Yu C
2025-10-02 15:40 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Chen, Yu C @ 2025-10-01 12:13 UTC (permalink / raw)
To: Dave Martin
Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, linux-doc
On 9/30/2025 11:55 PM, Dave Martin wrote:
> Hi,
>
> On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote:
>> On 9/29/2025 10:13 PM, Dave Martin wrote:
>
> [...]
>
>>> I guess it would be up to the arch code whether to trust ACPI if it
>>> says that the maximum value of this field is > 511. (> 65535 would be
>>> impossible though, since the fields would start to overlap each
>>> other...)
>>>
>>> Would anything break in the interface proposed here, if the maximum
>>> value is larger than 511? (I'm hoping not. For MPAM, the bandwidth
>>> controls can have up to 16 bits and the size can be probed though MMIO
>>> registers.
>>>
>>
>> I overlooked this bit width. It should not exceed 511 according to the
>> RDT spec. Previously, I was just wondering how to calculate the legacy
>> MB percentage in Tony's example. If we want to keep consistency - if
>> the user provides a value of 10, what is the denominator: Is it 255,
>> 511, or something queried from ACPI.
>>
>> MB: 0=4;1=75 <--- 10/255
>> #MB_HW: 0=10;1=191
>> #MB_MIN: 0=10;1=191
>> #MB_MAX: 0=64;1=191
>>
>> or
>>
>> MB: 0=1;1=75 <--- 10/511
>> #MB_HW: 0=10;1=191
>> #MB_MIN: 0=10;1=191
>> #MB_MAX: 0=64;1=191
>>
>> thanks,
>> Chenyu
>
> The denomiator (the "scale" parameter in my model, though the name is
> unimportant) should be whatever quantity of resource is specified in
> the "unit" parameter.
>
> For "percentage" type controls, I'd expect the unit to be 100% ("100pc"
> in my syntax).
>
> So, Tony suggestion looks plausible to me [1] :
>
> | Yes. 255 (or whatever "Q" value is provided in the ACPI table)
> | corresponds to no throttling, so 100% bandwidth.
>
> So, if ACPI says Q=387, that's the denominator we advertise.
>
> Does that sound right?
>
Yes, it makes sense, the denominator is the "scale" in your example.
> Question: is this a global parameter, or per-CPU?
>
It should be a global setting for all the MBA Register Blocks.
Thanks,
Chenyu
> From the v1.2 RDT spec, it looks like it is a single, global parameter.
> I hope this is true (!) But I'm not too familiar with these specs...
>
> Cheers
> ---Dave
>
>
> [1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/
>>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-01 12:13 ` Chen, Yu C
@ 2025-10-02 15:40 ` Dave Martin
2025-10-02 16:43 ` Luck, Tony
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-10-02 15:40 UTC (permalink / raw)
To: Chen, Yu C
Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, linux-doc
Hi there,
On Wed, Oct 01, 2025 at 08:13:45PM +0800, Chen, Yu C wrote:
> On 9/30/2025 11:55 PM, Dave Martin wrote:
> > Hi,
> >
> > On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote:
[...]
> > > I overlooked this bit width. It should not exceed 511 according to the
> > > RDT spec. Previously, I was just wondering how to calculate the legacy
> > > MB percentage in Tony's example. If we want to keep consistency - if
> > > the user provides a value of 10, what is the denominator: Is it 255,
> > > 511, or something queried from ACPI.
> > >
> > > MB: 0=4;1=75 <--- 10/255
> > > #MB_HW: 0=10;1=191
> > > #MB_MIN: 0=10;1=191
> > > #MB_MAX: 0=64;1=191
> > >
> > > or
> > >
> > > MB: 0=1;1=75 <--- 10/511
> > > #MB_HW: 0=10;1=191
> > > #MB_MIN: 0=10;1=191
> > > #MB_MAX: 0=64;1=191
> > >
> > > thanks,
> > > Chenyu
> >
> > The denomiator (the "scale" parameter in my model, though the name is
> > unimportant) should be whatever quantity of resource is specified in
> > the "unit" parameter.
> >
> > For "percentage" type controls, I'd expect the unit to be 100% ("100pc"
> > in my syntax).
> >
> > So, Tony suggestion looks plausible to me [1] :
> >
> > | Yes. 255 (or whatever "Q" value is provided in the ACPI table)
> > | corresponds to no throttling, so 100% bandwidth.
> >
> > So, if ACPI says Q=387, that's the denominator we advertise.
> >
> > Does that sound right?
> >
>
> Yes, it makes sense, the denominator is the "scale" in your example.
Thanks for confirming that.
> > Question: is this a global parameter, or per-CPU?
> >
>
> It should be a global setting for all the MBA Register Blocks.
That's good -- since resctrl resource controls are not per-CPU,
exposing the exact hardware resolution won't work unless the value
is scaled identically for all CPUs.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-02 15:40 ` Dave Martin
@ 2025-10-02 16:43 ` Luck, Tony
0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2025-10-02 16:43 UTC (permalink / raw)
To: Dave Martin, Chen, Yu C
Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86@kernel.org,
linux-doc@vger.kernel.org
> > > So, if ACPI says Q=387, that's the denominator we advertise.
> > >
> > > Does that sound right?
> > >
> >
> > Yes, it makes sense, the denominator is the "scale" in your example.
>
> Thanks for confirming that.
>
> > > Question: is this a global parameter, or per-CPU?
> > >
> >
> > It should be a global setting for all the MBA Register Blocks.
>
> That's good -- since resctrl resource controls are not per-CPU,
> exposing the exact hardware resolution won't work unless the value
> is scaled identically for all CPUs.
The RDT architecture spec says there is a separate MARC table that
describes each instance on an L3 cache.
So in theory there could be different "Q" values for each. I'm chatting
with the architects to point out that would be bad, and they shouldn't
build something that has different "Q" values on the same system.
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-30 15:40 ` Dave Martin
@ 2025-10-10 16:48 ` Reinette Chatre
2025-10-11 17:15 ` Chen, Yu C
2025-10-13 14:36 ` Dave Martin
0 siblings, 2 replies; 52+ messages in thread
From: Reinette Chatre @ 2025-10-10 16:48 UTC (permalink / raw)
To: Dave Martin
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 9/30/25 8:40 AM, Dave Martin wrote:
> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
>> On 9/29/25 6:56 AM, Dave Martin wrote:
>>> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
>>>> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
>>
>> ...
>>
>>>> The region aware h/w supports separate bandwidth controls for each
>>>> region. We could hope (or perhaps update the spec to define) that
>>>> region0 is always node-local DDR memory and keep the "MB" tag for
>>>> that.
>>>
>>> Do you have concerns about existing software choking on the #-prefixed
>>> lines?
>>
>> I am trying to understand the purpose of the #-prefix. I see two motivations
>> for the #-prefix with the primary point that multiple schema apply to the same
>> resource.
>>
>> 1) Commented schema are "inactive"
>> This is unclear to me. In the MB example the commented lines show the
>> finer grained controls. Since the original MB resource is an approximation
>> and the hardware must already be configured to support it, would the #-prefixed
>> lines not show the actual "active" configuration?
>
> They would show the active configuration (possibly more precisely than
> "MB" does).
That is how I see it also. This is specific to MB as we try to maintain
backward compatibility.
If we are going to make user interface changes to resource allocation then
ideally it should consider all known future usage. I am trying to navigate
and understand the discussion on how resctrl can support MPAM and this
RDT region aware requirements.
I scanned the MPAM spec and from what I understand a resource may support
multiple controls at the same time, each with its own properties, and then
there was this:
When multiple partitioning controls are active, each affects the partition’s
bandwidth usage. However, some combinations of controls may not make sense,
because the regulation of that pair of controls cannot be made to work in concert.
resctrl may thus present an "active configuration" that is not a configuration
that "makes sense" ... this may be ok as resctrl would present what hardware
supports combined with what user requested.
> If not, it's not clear how userspace that is trying to use MB_HW (say)
> could read out the current configuration.
>
> The # is intended to make resctrl ignore the lines when the file
> is written by userspace. This is done so that userspace has to
> actually change those lines in order for them to take effect when
> writing. Old userspace can just pass them through without modification,
> without anything unexpected happening.
Thank you for highlighting this. I did not consider this use case.
>
> The reason why I think that this convention may be needed is that we
> never told (old) userspace what it was supposed to do with schemata
> entries that it does not recognise.
>
>
>> 2) Commented schema are "conflicting"
>> The original proposal mentioned "write them back instead of (or in addition to)
>> the conflicting entries". I do not know how resctrl will be able to
>> handle a user requesting a change to both "MB" and "MB_HW". This seems like
>> something that should fail?
>
> If userspace is asking for two incompatible things at the same time, we
> can either pick one of them and ignore the rest, or do nothing, or fail
> explicitly.
>
> If we think that it doesn't really matter what happens, then resctrl
> could just dumbly process the entries in the order given. If the
> result is not what userspace wanted, that's not our problem.
>
> (Today, nothing prevents userspace writing multiple "MB" lines at the
> same time: resctrl will process them all, but only the final one will
> have a lasting effect. So, the fact that a resctrl write can contain
> mutually incompatible requests does not seem to be new.)
Good point.
>
>
>> On a high level it is not clear to me why the # prefix is needed. As I understand the
>> schemata names will always be unique and the new features made backward
>> compatible to existing schemata names. That is, existing MB, L3, etc.
>> will also have the new info files that describe their values/ranges.
>
> Regarding backwards compatibility for the existing controls:
>
> This proposal is only about numeric controls. L3 wouldn't change, but
> we could still add info/ metadata for bitmap control at the same time
> as adding it for numeric controls.
I think we should. At least we should leave space for such an addition since
it is not obvious to me how multiple resources with different controls or
single resource with multiple controls should be communicated to user space.
To be specific, the original proposal [1] introduced a set of files for
a numeric control and that seems to work for existing and upcoming
schema that need a value in a range. Different controls need different
parameters so to integrate this solution I think it needs another parameter
(presented as a directory, a file, or within a file) that indicates the
type of the control so that user space knows which files/parameters to expect
and how to interpret them.
Since different controls have different parameters we need to consider
whether it is easier to create/parse unique files for each control or
present all the parameters within one file with another file noting the type
of control.
I understand the files/parameters are intended to be in the schema's info directory
but how this will look is not obvious to me. Part of the MPAM refactoring transitioned
the top level info directories to represent the schema entries that currently reflect
the resources. When we start having multiple schema entries (multiple controls) for a
single resource the simplest implementation may result in a top level info
directory for every schema entry ... but the expectation is that these top
level directories should be per resource, no?
At this time I am envisioning the proposal to result in something like below where
there is one resource directory and one directory per schema entry with a (added by me)
"schema_type" file to help user find out what the schema type is to know which files are present:
MB
├── bandwidth_gran
├── delay_linear
├── MB
│ ├── map
│ ├── max
│ ├── min
│ ├── scale
│ ├── schema_type
│ └── unit
├── MB_HW
│ ├── map
│ ├── max
│ ├── min
│ ├── scale
│ ├── schema_type
│ └── unit
├── MB_MAX
│ └── tbd
├── MB_MIN
│ └── tbd
├── min_bandwidth
├── num_closids
└── thread_throttle_mode
Something else related to control that caught my eye in MPAM spec is this gem:
MPAM provides discoverable vendor extensions to permit partners
to invent partitioning controls.
> MB may be hard to describe in a useful way, though -- at least in the
> MPAM case, where the number of steps does not divide into 100, and the
> AMD cases where the meaning of the MB control values is different.
Above I do assume that MB would be represented in a new interface since it
is a schema entry, if that causes trouble then we could drop it.
>
> MB and MB_HW are not interchangeable. To obtain predictable results
> from MB, userspace would need to know precisely how the kernel is going
> to round the value. This feels like an implementation detail that
> doesn't belong in the ABI.
ack
...
> Anyway, going back to the "#" convention:
>
> If the initial read of schemata has the new entries "pre-commented",
> then userspace wouldn't need to know about the new entries. It could
> just tweak the MB entry (which it knows about), and write the file back:
>
> MB: 0=43
> # MB_HW: 0=2
> # MB_MIN: 0=1
> # MB_MAX: 0=2
>
> then resctrl knows to ignore the hashed lines, and so reading the file
> back gives:
>
> MB: 0=43
> # MB_HW: 0=3
> # MB_MIN: 0=2
> # MB_MAX: 0=3
Thank you for the example. This seems reasonable. I would like to go back
to what you wrote in [1]:
> Software that understands the new entries can uncomment the conflicting
> entries and write them back instead of (or in addition to) the
> conflicting entries. For example, userspace might write the following:
>
> MB_MIN: 0=16, 1=16
> MB_MAX: 0=32, 1=32
>
> Which might then read back as follows:
>
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=16, 1=16
> # MB_MAX: 0=32, 1=32
Could/should resctrl uncomment the lines after userspace modified them?
>
> (For hardware-specific reasons, the MPAM driver currently internally
> programs the MIN bound to be a bit less than the MAX bound, when
> userspace writes an "MB" entry into schemata. The key thing is that
> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
> resctrl level, I don't that that we necessarily need to make promises
> about what they can change _to_. The exact effect of MIN and MAX
> bounds is likely to be hardware-dependent anyway.)
MPAM has the "HARDLIM" distinction associated with these MAX values
and from what I can tell this is per PARTID. Is this something that needs
to be supported? To do this resctrl will need to support modifying
control properties per resource group.
>
>
> Regarding new userspce:
>
> Going forward, we can explicitly document that there should be no
> conflicting or "passenger" entries in a schemata write: don't include
> an entry for somehing that you don't explicitly want to set, and if
> multiple entries affect the same resource, we don't promise what
> happens.
>
> (But sadly, we can't impose that rule on existing software after the
> fact.)
It may thus not be worth it to make such a rule.
>
>
> One final note: I have not provided any way to indicate that all those
> entries control the same hardware resource. The common "MB" prefix is
> intended as a clue, but ultimately, userspace needs to know what an
> entry controls before tweaking it.
>
> We could try to describe the relationships explicitly, but I'm not sure
> that it is useful...
What other relationships should we consider for MPAM? I see that each
MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual,
... ? Is it expected that MPAM's support of these should be exposed via resctrl?
Have you considered how to express if user wants hardware to have different
allocations for, for example, same PARTID at different execution levels?
Reinette
[1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-10 16:48 ` Reinette Chatre
@ 2025-10-11 17:15 ` Chen, Yu C
2025-10-13 15:01 ` Dave Martin
2025-10-13 14:36 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Chen, Yu C @ 2025-10-11 17:15 UTC (permalink / raw)
To: Reinette Chatre, Dave Martin
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
On 10/11/2025 12:48 AM, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/30/25 8:40 AM, Dave Martin wrote:
>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
>>> On 9/29/25 6:56 AM, Dave Martin wrote:
>>>> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
>>>>> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
[snip]
>> Anyway, going back to the "#" convention:
>>
>> If the initial read of schemata has the new entries "pre-commented",
>> then userspace wouldn't need to know about the new entries. It could
>> just tweak the MB entry (which it knows about), and write the file back:
>>
>> MB: 0=43
>> # MB_HW: 0=2
>> # MB_MIN: 0=1
>> # MB_MAX: 0=2
>>
>> then resctrl knows to ignore the hashed lines, and so reading the file
>> back gives:
>>
>> MB: 0=43
>> # MB_HW: 0=3
>> # MB_MIN: 0=2
>> # MB_MAX: 0=3
May I ask if introducing the pre-commented lines is intended to prevent
control conflicts over the same MBA? If this is the case, I wonder if,
instead of exposing both MB and the pre-commented MB_HW_X in one file,
it would be feasible to introduce a new mount option (such as "hw") to
make the legacy MB and MB_HW_X mutually exclusive. If the user specifies
"hw" in mount option, display MB_HW_X (if available); otherwise, display
only the legacy "MB". This is similar to the cpufreq governor, where only
one governor is allowed to adjust the CPU frequency at a time.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-10 16:48 ` Reinette Chatre
2025-10-11 17:15 ` Chen, Yu C
@ 2025-10-13 14:36 ` Dave Martin
2025-10-14 22:55 ` Reinette Chatre
1 sibling, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-10-13 14:36 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 9/30/25 8:40 AM, Dave Martin wrote:
> > On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
> >> On 9/29/25 6:56 AM, Dave Martin wrote:
[...]
> >> 1) Commented schema are "inactive"
> >> This is unclear to me. In the MB example the commented lines show the
> >> finer grained controls. Since the original MB resource is an approximation
> >> and the hardware must already be configured to support it, would the #-prefixed
> >> lines not show the actual "active" configuration?
> >
> > They would show the active configuration (possibly more precisely than
> > "MB" does).
>
> That is how I see it also. This is specific to MB as we try to maintain
> backward compatibility.
>
> If we are going to make user interface changes to resource allocation then
> ideally it should consider all known future usage. I am trying to navigate
> and understand the discussion on how resctrl can support MPAM and this
> RDT region aware requirements.
>
> I scanned the MPAM spec and from what I understand a resource may support
> multiple controls at the same time, each with its own properties, and then
> there was this:
>
> When multiple partitioning controls are active, each affects the partition’s
> bandwidth usage. However, some combinations of controls may not make sense,
> because the regulation of that pair of controls cannot be made to work in concert.
>
> resctrl may thus present an "active configuration" that is not a configuration
> that "makes sense" ... this may be ok as resctrl would present what hardware
> supports combined with what user requested.
This is analogous to what the MPAM spec says, though if resctrl offers
two different schemata for the same hardware control, the control cannot be
configured with both values simultaneously.
For the MPAM hardware controls affecting the same hardware resource,
they can be programmed to combinations of values that have no sensible
interpretation, and the values can be read back just fine. The
performance effects may not be what the user expected / wanted, but
this is not directly visible to resctrl.
So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user
can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that
will read back just as programmed. The architecture does not promise
what the performance effect of this will be, but resctrl does not need
to care.
[...]
> > The # is intended to make resctrl ignore the lines when the file
> > is written by userspace. This is done so that userspace has to
> > actually change those lines in order for them to take effect when
> > writing. Old userspace can just pass them through without modification,
> > without anything unexpected happening.
>
> Thank you for highlighting this. I did not consider this use case.
[...]
> >> 2) Commented schema are "conflicting"
> >> The original proposal mentioned "write them back instead of (or in addition to)
> >> the conflicting entries". I do not know how resctrl will be able to
> >> handle a user requesting a change to both "MB" and "MB_HW". This seems like
> >> something that should fail?
> >
> > If userspace is asking for two incompatible things at the same time, we
> > can either pick one of them and ignore the rest, or do nothing, or fail
> > explicitly.
> >
> > If we think that it doesn't really matter what happens, then resctrl
> > could just dumbly process the entries in the order given. If the
> > result is not what userspace wanted, that's not our problem.
> >
> > (Today, nothing prevents userspace writing multiple "MB" lines at the
> > same time: resctrl will process them all, but only the final one will
> > have a lasting effect. So, the fact that a resctrl write can contain
> > mutually incompatible requests does not seem to be new.)
>
> Good point.
[...]
> > This proposal is only about numeric controls. L3 wouldn't change, but
> > we could still add info/ metadata for bitmap control at the same time
> > as adding it for numeric controls.
>
> I think we should. At least we should leave space for such an addition since
> it is not obvious to me how multiple resources with different controls or
> single resource with multiple controls should be communicated to user space.
>
> To be specific, the original proposal [1] introduced a set of files for
> a numeric control and that seems to work for existing and upcoming
> schema that need a value in a range. Different controls need different
> parameters so to integrate this solution I think it needs another parameter
> (presented as a directory, a file, or within a file) that indicates the
> type of the control so that user space knows which files/parameters to expect
> and how to interpret them.
Agreed. I wasn't meaning to imply that this proposal shouldn't be
integrated into something more general. If we want a richer
description than the current one, it makes sense to incorporate bitmap
controls -- this just wasn't my focus.
> Since different controls have different parameters we need to consider
> whether it is easier to create/parse unique files for each control or
> present all the parameters within one file with another file noting the type
> of control.
Separate files works quite well for low-tech tooling built using shell
scripts, and this seems to follow the sysfs philosophy. Since there is
no need to keep re-reading these parameters, simplicity feels more
important than efficiency?
But we could equally have a single file with multiple pieces of
information in it.
I don't have a strong view on this.
> I understand the files/parameters are intended to be in the schema's info directory
> but how this will look is not obvious to me. Part of the MPAM refactoring transitioned
> the top level info directories to represent the schema entries that currently reflect
> the resources. When we start having multiple schema entries (multiple controls) for a
> single resource the simplest implementation may result in a top level info
> directory for every schema entry ... but the expectation is that these top
> level directories should be per resource, no?
I had not considered that the info/ directories correspond to resources,
not individual schemata...
>
> At this time I am envisioning the proposal to result in something like below where
> there is one resource directory and one directory per schema entry with a (added by me)
> "schema_type" file to help user find out what the schema type is to know which files are present:
>
> MB
> ├── bandwidth_gran
> ├── delay_linear
> ├── MB
> │ ├── map
> │ ├── max
> │ ├── min
> │ ├── scale
> │ ├── schema_type
> │ └── unit
> ├── MB_HW
> │ ├── map
> │ ├── max
> │ ├── min
> │ ├── scale
> │ ├── schema_type
> │ └── unit
> ├── MB_MAX
> │ └── tbd
> ├── MB_MIN
> │ └── tbd
> ├── min_bandwidth
> ├── num_closids
> └── thread_throttle_mode
I see no reason not to do that. Either way, older userspace just
ignores the new files and directories.
Perhaps add an intermediate subdirectory to clarify the relationship
between the resource dir and the individual schema descriptions?
This may also avoid the new descriptions getting mixed up with the old
description files.
Say,
info
├── MB
│ ├── resource_schemata
│ │ ├── MB
│ │ │ ├── map
│ │ │ ├── max
│ ┆ │ ├── min
│ │ ┆
┆ │
├── MB_HW
│ ├── map
│ ┆
┆
> Something else related to control that caught my eye in MPAM spec is this gem:
> MPAM provides discoverable vendor extensions to permit partners
> to invent partitioning controls.
Yup.
Since we have no way to know what vendor-specific controls look like or
what they mean, we can't do much about this.
So, it's the vendor's job to implement support for it, and we might
still say no (if there is no sane way to integrate it).
> > MB may be hard to describe in a useful way, though -- at least in the
> > MPAM case, where the number of steps does not divide into 100, and the
> > AMD cases where the meaning of the MB control values is different.
>
> Above I do assume that MB would be represented in a new interface since it
> is a schema entry, if that causes trouble then we could drop it.
Since MB is described by the existing files and the documentation,
perhaps this it doesn't need an additional description.
Alternatively though, could we just have a special schema_type for this,
and omit the other properties? This would mean that we at least have
an entry for every schema.
>
> >
> > MB and MB_HW are not interchangeable. To obtain predictable results
> > from MB, userspace would need to know precisely how the kernel is going
> > to round the value. This feels like an implementation detail that
> > doesn't belong in the ABI.
>
> ack
>
> ...
>
> > Anyway, going back to the "#" convention:
> >
> > If the initial read of schemata has the new entries "pre-commented",
> > then userspace wouldn't need to know about the new entries. It could
> > just tweak the MB entry (which it knows about), and write the file back:
> >
> > MB: 0=43
> > # MB_HW: 0=2
> > # MB_MIN: 0=1
> > # MB_MAX: 0=2
> >
> > then resctrl knows to ignore the hashed lines, and so reading the file
> > back gives:
> >
> > MB: 0=43
> > # MB_HW: 0=3
> > # MB_MIN: 0=2
> > # MB_MAX: 0=3
>
> Thank you for the example. This seems reasonable. I would like to go back
> to what you wrote in [1]:
>
> > Software that understands the new entries can uncomment the conflicting
> > entries and write them back instead of (or in addition to) the
> > conflicting entries. For example, userspace might write the following:
> >
> > MB_MIN: 0=16, 1=16
> > MB_MAX: 0=32, 1=32
> >
> > Which might then read back as follows:
> >
> > MB: 0=50, 1=50
> > # MB_HW: 0=32, 1=32
> > # MB_MIN: 0=16, 1=16
> > # MB_MAX: 0=32, 1=32
>
> Could/should resctrl uncomment the lines after userspace modified them?
The '#' wasn't meant to be a state that gets turned on and off.
Rather, userspace would use this to indicate which entries are
intentionally being modified.
So long as the entries affecting a single resource are ordered so that
each entry is strictly more specific than the previous entries (as
illustrated above), then reading schemata and stripping all the hashes
would allow a previous configuration to be restored; to change just one
entry, userspace can uncomment just that one, or write only that entry
(which is what I think we should recommend for new software).
> > (For hardware-specific reasons, the MPAM driver currently internally
> > programs the MIN bound to be a bit less than the MAX bound, when
> > userspace writes an "MB" entry into schemata. The key thing is that
> > writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
> > resctrl level, I don't that that we necessarily need to make promises
> > about what they can change _to_. The exact effect of MIN and MAX
> > bounds is likely to be hardware-dependent anyway.)
>
> MPAM has the "HARDLIM" distinction associated with these MAX values
> and from what I can tell this is per PARTID. Is this something that needs
> to be supported? To do this resctrl will need to support modifying
> control properties per resource group.
Possibly. Since this is a boolean control that determines how the
MBW_MAX control is applied, we could perhaps present it as an
additional schema -- if so, it's basically orthogonal.
| MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
or
| MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
Does this look reasonable?
I don't know whether we have a clear use case for this today, and like
almost everything else in MPAM, implementing it is optional...
> > Regarding new userspce:
> >
> > Going forward, we can explicitly document that there should be no
> > conflicting or "passenger" entries in a schemata write: don't include
> > an entry for somehing that you don't explicitly want to set, and if
> > multiple entries affect the same resource, we don't promise what
> > happens.
> >
> > (But sadly, we can't impose that rule on existing software after the
> > fact.)
>
> It may thus not be worth it to make such a rule.
Ack. Perhaps we could recommend it, though.
(At the very least, avoiding writing redundant entries would be a
little more efficient for the user.)
> >
> >
> > One final note: I have not provided any way to indicate that all those
> > entries control the same hardware resource. The common "MB" prefix is
> > intended as a clue, but ultimately, userspace needs to know what an
> > entry controls before tweaking it.
> >
> > We could try to describe the relationships explicitly, but I'm not sure
> > that it is useful...
>
> What other relationships should we consider for MPAM? I see that each
> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual,
> ... ? Is it expected that MPAM's support of these should be exposed via resctrl?
Probably not. These are best regarded as entirely separate instances
of MPAM; the PARTID spaces are separate. The Non-secure physical
address space is the only physical address space directly accessible to
Linux -- for the others, we can't address the MMIO registers anyway.
For now, the other address spaces are the firmware's problem.
> Have you considered how to express if user wants hardware to have different
> allocations for, for example, same PARTID at different execution levels?
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
MPAM doesn't allow different controls for a PARTID depending on the
exception level, but it is possible to program different PARTIDs for
hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0).
I think that if we wanted to go down that road, we would want to expose
additional "task IDs" in resctrlfs that can be placed into groups
independently, say
echo 14161:kernel >>.../some_group/tasks
echo 14161:user >>.../other_group/tasks
However, inside the kernel, the boundary between work done on behalf of
a specific userspace task, work done on behalf of userspace in general,
and autonomous work inside the kernel is fuzzy and not well defined.
For this reason, we currently only configure the PARTID for EL0. For
EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0).
Hopefully this is orthogonal to the discussion of schema descriptions,
though ...?
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-11 17:15 ` Chen, Yu C
@ 2025-10-13 15:01 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-10-13 15:01 UTC (permalink / raw)
To: Chen, Yu C
Cc: Reinette Chatre, Luck, Tony, linux-kernel, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86, linux-doc
Hi,
On Sun, Oct 12, 2025 at 01:15:07AM +0800, Chen, Yu C wrote:
> On 10/11/2025 12:48 AM, Reinette Chatre wrote:
> > Hi Dave,
> >
> > On 9/30/25 8:40 AM, Dave Martin wrote:
> > > On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
> > > > On 9/29/25 6:56 AM, Dave Martin wrote:
> > > > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote:
> > > > > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote:
>
> [snip]
>
> > > Anyway, going back to the "#" convention:
> > >
> > > If the initial read of schemata has the new entries "pre-commented",
> > > then userspace wouldn't need to know about the new entries. It could
> > > just tweak the MB entry (which it knows about), and write the file back:
> > >
> > > MB: 0=43
> > > # MB_HW: 0=2
> > > # MB_MIN: 0=1
> > > # MB_MAX: 0=2
> > >
> > > then resctrl knows to ignore the hashed lines, and so reading the file
> > > back gives:
> > >
> > > MB: 0=43
> > > # MB_HW: 0=3
> > > # MB_MIN: 0=2
> > > # MB_MAX: 0=3
>
>
> May I ask if introducing the pre-commented lines is intended to prevent
> control conflicts over the same MBA? If this is the case, I wonder if,
Basically, yes. Note, this is only an issue for old software that
doesn't understand the new entries. New software should omit the
entries for resources that doesn't understand or doesn't want to set.
> instead of exposing both MB and the pre-commented MB_HW_X in one file,
> it would be feasible to introduce a new mount option (such as "hw") to
> make the legacy MB and MB_HW_X mutually exclusive. If the user specifies
> "hw" in mount option, display MB_HW_X (if available); otherwise, display
> only the legacy "MB". This is similar to the cpufreq governor, where only
> one governor is allowed to adjust the CPU frequency at a time.
>
> thanks,
> Chenyu
This could be done with a mount option, but I am concerned that a
single resctrl mount might be used by different tools while it is
mounted. So, maybe a global control is not what we want?
I'm hoping that we can design this in a way that is sufficiently
backwards compatible that we don't need a mount option to turn it off.
Can you think of a userspace scenario that would break, with the
proposed design?
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-13 14:36 ` Dave Martin
@ 2025-10-14 22:55 ` Reinette Chatre
2025-10-15 15:47 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-10-14 22:55 UTC (permalink / raw)
To: Dave Martin
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 10/13/25 7:36 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 9/30/25 8:40 AM, Dave Martin wrote:
>>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
>>>> On 9/29/25 6:56 AM, Dave Martin wrote:
>
> [...]
>
>>>> 1) Commented schema are "inactive"
>>>> This is unclear to me. In the MB example the commented lines show the
>>>> finer grained controls. Since the original MB resource is an approximation
>>>> and the hardware must already be configured to support it, would the #-prefixed
>>>> lines not show the actual "active" configuration?
>>>
>>> They would show the active configuration (possibly more precisely than
>>> "MB" does).
>>
>> That is how I see it also. This is specific to MB as we try to maintain
>> backward compatibility.
>>
>> If we are going to make user interface changes to resource allocation then
>> ideally it should consider all known future usage. I am trying to navigate
>> and understand the discussion on how resctrl can support MPAM and this
>> RDT region aware requirements.
>>
>> I scanned the MPAM spec and from what I understand a resource may support
>> multiple controls at the same time, each with its own properties, and then
>> there was this:
>>
>> When multiple partitioning controls are active, each affects the partition’s
>> bandwidth usage. However, some combinations of controls may not make sense,
>> because the regulation of that pair of controls cannot be made to work in concert.
>>
>> resctrl may thus present an "active configuration" that is not a configuration
>> that "makes sense" ... this may be ok as resctrl would present what hardware
>> supports combined with what user requested.
>
> This is analogous to what the MPAM spec says, though if resctrl offers
> two different schemata for the same hardware control, the control cannot be
> configured with both values simultaneously.
>
> For the MPAM hardware controls affecting the same hardware resource,
> they can be programmed to combinations of values that have no sensible
> interpretation, and the values can be read back just fine. The
> performance effects may not be what the user expected / wanted, but
> this is not directly visible to resctrl.
>
> So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user
> can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that
> will read back just as programmed. The architecture does not promise
> what the performance effect of this will be, but resctrl does not need
> to care.
The same appears to be true for Intel RDT where the spec warns ("Undesirable
and undefined performance effects may result if cap programming guidelines
are not followed.") but does not seem to prevent such configurations.
...
>>> This proposal is only about numeric controls. L3 wouldn't change, but
>>> we could still add info/ metadata for bitmap control at the same time
>>> as adding it for numeric controls.
>>
>> I think we should. At least we should leave space for such an addition since
>> it is not obvious to me how multiple resources with different controls or
>> single resource with multiple controls should be communicated to user space.
>>
>> To be specific, the original proposal [1] introduced a set of files for
>> a numeric control and that seems to work for existing and upcoming
>> schema that need a value in a range. Different controls need different
>> parameters so to integrate this solution I think it needs another parameter
>> (presented as a directory, a file, or within a file) that indicates the
>> type of the control so that user space knows which files/parameters to expect
>> and how to interpret them.
>
> Agreed. I wasn't meaning to imply that this proposal shouldn't be
> integrated into something more general. If we want a richer
> description than the current one, it makes sense to incorporate bitmap
> controls -- this just wasn't my focus.
Understood.
>
>> Since different controls have different parameters we need to consider
>> whether it is easier to create/parse unique files for each control or
>> present all the parameters within one file with another file noting the type
>> of control.
>
> Separate files works quite well for low-tech tooling built using shell
> scripts, and this seems to follow the sysfs philosophy. Since there is
> no need to keep re-reading these parameters, simplicity feels more
> important than efficiency?
>
> But we could equally have a single file with multiple pieces of
> information in it.
>
> I don't have a strong view on this.
If by sysfs philosophy you men "one value per file" then resctrl split from that from
the beginning (with the schemata file). I am also not advocating for one or the other
at this time but believe we have some flexibility when faced with implementation
options/challenges.
>
>> I understand the files/parameters are intended to be in the schema's info directory
>> but how this will look is not obvious to me. Part of the MPAM refactoring transitioned
>> the top level info directories to represent the schema entries that currently reflect
>> the resources. When we start having multiple schema entries (multiple controls) for a
>> single resource the simplest implementation may result in a top level info
>> directory for every schema entry ... but the expectation is that these top
>> level directories should be per resource, no?
>
> I had not considered that the info/ directories correspond to resources,
> not individual schemata...
>>
>> At this time I am envisioning the proposal to result in something like below where
>> there is one resource directory and one directory per schema entry with a (added by me)
>> "schema_type" file to help user find out what the schema type is to know which files are present:
>>
>> MB
>> ├── bandwidth_gran
>> ├── delay_linear
>> ├── MB
>> │ ├── map
>> │ ├── max
>> │ ├── min
>> │ ├── scale
>> │ ├── schema_type
>> │ └── unit
>> ├── MB_HW
>> │ ├── map
>> │ ├── max
>> │ ├── min
>> │ ├── scale
>> │ ├── schema_type
>> │ └── unit
>> ├── MB_MAX
>> │ └── tbd
>> ├── MB_MIN
>> │ └── tbd
>> ├── min_bandwidth
>> ├── num_closids
>> └── thread_throttle_mode
>
> I see no reason not to do that. Either way, older userspace just
> ignores the new files and directories.
>
> Perhaps add an intermediate subdirectory to clarify the relationship
> between the resource dir and the individual schema descriptions?
>
> This may also avoid the new descriptions getting mixed up with the old
> description files.
>
> Say,
>
> info
> ├── MB
> │ ├── resource_schemata
> │ │ ├── MB
> │ │ │ ├── map
> │ │ │ ├── max
> │ ┆ │ ├── min
> │ │ ┆
> ┆ │
> ├── MB_HW
> │ ├── map
> │ ┆
> ┆
Looks good to me.
>
>> Something else related to control that caught my eye in MPAM spec is this gem:
>> MPAM provides discoverable vendor extensions to permit partners
>> to invent partitioning controls.
>
> Yup.
>
> Since we have no way to know what vendor-specific controls look like or
> what they mean, we can't do much about this.
>
> So, it's the vendor's job to implement support for it, and we might
> still say no (if there is no sane way to integrate it).
ack.
>
>>> MB may be hard to describe in a useful way, though -- at least in the
>>> MPAM case, where the number of steps does not divide into 100, and the
>>> AMD cases where the meaning of the MB control values is different.
>>
>> Above I do assume that MB would be represented in a new interface since it
>> is a schema entry, if that causes trouble then we could drop it.
>
> Since MB is described by the existing files and the documentation,
> perhaps this it doesn't need an additional description.
>
> Alternatively though, could we just have a special schema_type for this,
> and omit the other properties? This would mean that we at least have
> an entry for every schema.
We could do this, yes.
>>> MB and MB_HW are not interchangeable. To obtain predictable results
>>> from MB, userspace would need to know precisely how the kernel is going
>>> to round the value. This feels like an implementation detail that
>>> doesn't belong in the ABI.
>>
>> ack
>>
>> ...
>>
>>> Anyway, going back to the "#" convention:
>>>
>>> If the initial read of schemata has the new entries "pre-commented",
>>> then userspace wouldn't need to know about the new entries. It could
>>> just tweak the MB entry (which it knows about), and write the file back:
>>>
>>> MB: 0=43
>>> # MB_HW: 0=2
>>> # MB_MIN: 0=1
>>> # MB_MAX: 0=2
>>>
>>> then resctrl knows to ignore the hashed lines, and so reading the file
>>> back gives:
>>>
>>> MB: 0=43
>>> # MB_HW: 0=3
>>> # MB_MIN: 0=2
>>> # MB_MAX: 0=3
>>
>> Thank you for the example. This seems reasonable. I would like to go back
>> to what you wrote in [1]:
>>
>>> Software that understands the new entries can uncomment the conflicting
>>> entries and write them back instead of (or in addition to) the
>>> conflicting entries. For example, userspace might write the following:
>>>
>>> MB_MIN: 0=16, 1=16
>>> MB_MAX: 0=32, 1=32
>>>
>>> Which might then read back as follows:
>>>
>>> MB: 0=50, 1=50
>>> # MB_HW: 0=32, 1=32
>>> # MB_MIN: 0=16, 1=16
>>> # MB_MAX: 0=32, 1=32
>>
>> Could/should resctrl uncomment the lines after userspace modified them?
>
> The '#' wasn't meant to be a state that gets turned on and off.
Thank you for clarifying.
> Rather, userspace would use this to indicate which entries are
> intentionally being modified.
I see. I assume that we should not see many of these '#' entries and expect
the ones we do see to shadow the legacy schemata entries. New schemata entries
(that do not shadow legacy ones) should not have the '#' prefix even if
their initial support does not include all controls.
> So long as the entries affecting a single resource are ordered so that
> each entry is strictly more specific than the previous entries (as
> illustrated above), then reading schemata and stripping all the hashes
> would allow a previous configuration to be restored; to change just one
> entry, userspace can uncomment just that one, or write only that entry
> (which is what I think we should recommend for new software).
This is a good rule of thumb.
>
>>> (For hardware-specific reasons, the MPAM driver currently internally
>>> programs the MIN bound to be a bit less than the MAX bound, when
>>> userspace writes an "MB" entry into schemata. The key thing is that
>>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
>>> resctrl level, I don't that that we necessarily need to make promises
>>> about what they can change _to_. The exact effect of MIN and MAX
>>> bounds is likely to be hardware-dependent anyway.)
>>
>> MPAM has the "HARDLIM" distinction associated with these MAX values
>> and from what I can tell this is per PARTID. Is this something that needs
>> to be supported? To do this resctrl will need to support modifying
>> control properties per resource group.
>
> Possibly. Since this is a boolean control that determines how the
> MBW_MAX control is applied, we could perhaps present it as an
> additional schema -- if so, it's basically orthogonal.
>
> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>
> or
>
> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>
> Does this look reasonable?
It does.
>
> I don't know whether we have a clear use case for this today, and like
> almost everything else in MPAM, implementing it is optional...
>
>
>>> Regarding new userspce:
>>>
>>> Going forward, we can explicitly document that there should be no
>>> conflicting or "passenger" entries in a schemata write: don't include
>>> an entry for somehing that you don't explicitly want to set, and if
>>> multiple entries affect the same resource, we don't promise what
>>> happens.
>>>
>>> (But sadly, we can't impose that rule on existing software after the
>>> fact.)
>>
>> It may thus not be worth it to make such a rule.
>
> Ack. Perhaps we could recommend it, though.
We could, yes.
>
> (At the very least, avoiding writing redundant entries would be a
> little more efficient for the user.)
>
>>>
>>>
>>> One final note: I have not provided any way to indicate that all those
>>> entries control the same hardware resource. The common "MB" prefix is
>>> intended as a clue, but ultimately, userspace needs to know what an
>>> entry controls before tweaking it.
>>>
>>> We could try to describe the relationships explicitly, but I'm not sure
>>> that it is useful...
>>
>> What other relationships should we consider for MPAM? I see that each
>> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual,
>> ... ? Is it expected that MPAM's support of these should be exposed via resctrl?
>
> Probably not. These are best regarded as entirely separate instances
> of MPAM; the PARTID spaces are separate. The Non-secure physical
> address space is the only physical address space directly accessible to
> Linux -- for the others, we can't address the MMIO registers anyway.
>
> For now, the other address spaces are the firmware's problem.
Thank you.
>
>> Have you considered how to express if user wants hardware to have different
>> allocations for, for example, same PARTID at different execution levels?
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
>
> MPAM doesn't allow different controls for a PARTID depending on the
> exception level, but it is possible to program different PARTIDs for
> hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0).
I misunderstood this from the spec. Thank you for clarifying.
>
> I think that if we wanted to go down that road, we would want to expose
> additional "task IDs" in resctrlfs that can be placed into groups
> independently, say
>
> echo 14161:kernel >>.../some_group/tasks
> echo 14161:user >>.../other_group/tasks
>
> However, inside the kernel, the boundary between work done on behalf of
> a specific userspace task, work done on behalf of userspace in general,
> and autonomous work inside the kernel is fuzzy and not well defined.
>
> For this reason, we currently only configure the PARTID for EL0. For
> EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0).
>
> Hopefully this is orthogonal to the discussion of schema descriptions,
> though ...?
Yes.
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-09-22 14:39 ` Dave Martin
2025-09-23 17:27 ` Reinette Chatre
@ 2025-10-15 15:18 ` Dave Martin
2025-10-16 15:57 ` Reinette Chatre
1 sibling, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-10-15 15:18 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
Just following up on the skipped L2_NONCONT_CAT test -- see below.
[...]
On Mon, Sep 22, 2025 at 03:39:47PM +0100, Dave Martin wrote:
[...]
> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
[...]
> > On 9/2/25 9:24 AM, Dave Martin wrote:
[...]
> > > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
> > > the other tests except for the NONCONT_CAT tests, which do not seem to
> > > be supported in my configuration -- and have nothing to do with the
> > > code touched by this patch).
> >
> > Is the NONCONT_CAT test failing (i.e printing "not ok")?
> >
> > The NONCONT_CAT tests may print error messages as debug information as part of
> > running, but these errors are expected as part of the test. The test should accurately
> > state whether it passed or failed though. For example, below attempts to write
> > a non-contiguous CBM to a system that does not support non-contiguous masks.
> > This fails as expected, error messages printed as debugging and thus the test passes
> > with an "ok".
> >
> > # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
> > # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
> > ok 5 L3_NONCONT_CAT: test
>
> I don't think that this was anything to do with my changes, but I don't
> still seem to have the test output. (Since this test has to do with
> bitmap schemata (?), it seemed unlikely to be affected by changes to
> bw_validate().)
>
> I'll need to re-test with and without this patch to check whether it
> makes any difference.
I finally got around to testing this on top of -rc1.
Disregarding trivial differences, the patched version (+++) doesn't
seem to introduce any regressions over the vanilla version (---)
(below). (The CMT test actually failed with an out-of-tolerance result
on the vanilla kernel only. Possibly there was some adverse system
load interfering.)
Looking at the code, it seems that L2_NONCONT_CAT is not gated by any
config or mount option. I think this is just a feature that my
hardware doesn't support (?)
arch/x86/kernel/cpu/resctrl/core.c has:
| static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
| {
[...]
| if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
| r->cache.arch_has_sparse_bitmasks = ecx.split.noncont;
[...]
| }
Cheers
---Dave
Full diff of the test output:
--- base/resctrl_tests_6.18.0-rc1.out 2025-10-14 17:11:56.000000000 +0100
+++ test1/resctrl_tests_6.18.0-rc1-test1.out 2025-10-14 17:21:44.000000000 +0100
@@ -1,132 +1,132 @@
TAP version 13
# Pass: Check kernel supports resctrl filesystem
# Pass: Check resctrl mountpoint "/sys/fs/resctrl" exists
# resctrl filesystem not mounted
-# dmesg: [ 1.409003] resctrl: L3 allocation detected
-# dmesg: [ 1.409040] resctrl: MB allocation detected
-# dmesg: [ 1.409072] resctrl: L3 monitoring detected
+# dmesg: [ 1.411733] resctrl: L3 allocation detected
+# dmesg: [ 1.411792] resctrl: MB allocation detected
+# dmesg: [ 1.411831] resctrl: L3 monitoring detected
1..6
# Starting MBM test ...
# Mounting resctrl to "/sys/fs/resctrl"
# Writing benchmark parameters to resctrl FS
-# Benchmark PID: 5126
+# Benchmark PID: 4954
# Write schema "MB:0=100" to resctrl FS
# Checking for pass/fail
# Pass: Check MBM diff within 8%
# avg_diff_per: 0%
# Span (MB): 250
-# avg_bw_imc: 6422
-# avg_bw_resc: 6392
+# avg_bw_imc: 6886
+# avg_bw_resc: 6943
ok 1 MBM: test
# Starting MBA test ...
# Mounting resctrl to "/sys/fs/resctrl"
# Writing benchmark parameters to resctrl FS
-# Benchmark PID: 5129
+# Benchmark PID: 4957
# Write schema "MB:0=10" to resctrl FS
# Write schema "MB:0=20" to resctrl FS
# Write schema "MB:0=30" to resctrl FS
# Write schema "MB:0=40" to resctrl FS
# Write schema "MB:0=50" to resctrl FS
# Write schema "MB:0=60" to resctrl FS
# Write schema "MB:0=70" to resctrl FS
# Write schema "MB:0=80" to resctrl FS
# Write schema "MB:0=90" to resctrl FS
# Write schema "MB:0=100" to resctrl FS
# Results are displayed in (MB)
# Pass: Check MBA diff within 8% for schemata 10
-# avg_diff_per: 1%
-# avg_bw_imc: 2033
-# avg_bw_resc: 2012
+# avg_diff_per: 0%
+# avg_bw_imc: 2028
+# avg_bw_resc: 2032
# Pass: Check MBA diff within 8% for schemata 20
# avg_diff_per: 0%
-# avg_bw_imc: 3028
-# avg_bw_resc: 3005
+# avg_bw_imc: 3006
+# avg_bw_resc: 3011
# Pass: Check MBA diff within 8% for schemata 30
# avg_diff_per: 0%
-# avg_bw_imc: 3982
-# avg_bw_resc: 3958
+# avg_bw_imc: 4006
+# avg_bw_resc: 4013
# Pass: Check MBA diff within 8% for schemata 40
# avg_diff_per: 0%
-# avg_bw_imc: 6265
-# avg_bw_resc: 6236
+# avg_bw_imc: 6726
+# avg_bw_resc: 6732
# Pass: Check MBA diff within 8% for schemata 50
# avg_diff_per: 0%
-# avg_bw_imc: 6384
-# avg_bw_resc: 6355
+# avg_bw_imc: 6854
+# avg_bw_resc: 6856
# Pass: Check MBA diff within 8% for schemata 60
# avg_diff_per: 0%
-# avg_bw_imc: 6405
-# avg_bw_resc: 6376
+# avg_bw_imc: 6882
+# avg_bw_resc: 6883
# Pass: Check MBA diff within 8% for schemata 70
# avg_diff_per: 0%
-# avg_bw_imc: 6417
-# avg_bw_resc: 6387
+# avg_bw_imc: 6891
+# avg_bw_resc: 6889
# Pass: Check MBA diff within 8% for schemata 80
# avg_diff_per: 0%
-# avg_bw_imc: 6418
-# avg_bw_resc: 6394
+# avg_bw_imc: 6893
+# avg_bw_resc: 6909
# Pass: Check MBA diff within 8% for schemata 90
# avg_diff_per: 0%
-# avg_bw_imc: 6412
-# avg_bw_resc: 6384
+# avg_bw_imc: 6890
+# avg_bw_resc: 6888
# Pass: Check MBA diff within 8% for schemata 100
# avg_diff_per: 0%
-# avg_bw_imc: 6425
-# avg_bw_resc: 6399
+# avg_bw_imc: 6929
+# avg_bw_resc: 6951
# Pass: Check schemata change using MBA
ok 2 MBA: test
# Starting CMT test ...
# Mounting resctrl to "/sys/fs/resctrl"
# Cache size :23068672
# Writing benchmark parameters to resctrl FS
-# Benchmark PID: 5135
+# Benchmark PID: 4970
# Checking for pass/fail
-# Fail: Check cache miss rate within 15%
-# Percent diff=24
+# Pass: Check cache miss rate within 15%
+# Percent diff=4
# Number of bits: 5
-# Average LLC val: 7942963
+# Average LLC val: 10918297
# Cache span (bytes): 10485760
-not ok 3 CMT: test
+ok 3 CMT: test
# Starting L3_CAT test ...
# Mounting resctrl to "/sys/fs/resctrl"
# Cache size :23068672
# Writing benchmark parameters to resctrl FS
# Write schema "L3:0=1f0" to resctrl FS
# Write schema "L3:0=f" to resctrl FS
# Write schema "L3:0=1f8" to resctrl FS
# Write schema "L3:0=7" to resctrl FS
# Write schema "L3:0=1fc" to resctrl FS
# Write schema "L3:0=3" to resctrl FS
# Write schema "L3:0=1fe" to resctrl FS
# Write schema "L3:0=1" to resctrl FS
# Checking for pass/fail
# Number of bits: 4
-# Average LLC val: 71434
+# Average LLC val: 70161
# Cache span (lines): 131072
# Pass: Check cache miss rate changed more than 2.0%
-# Percent diff=70.0
+# Percent diff=72.1
# Number of bits: 3
-# Average LLC val: 121463
+# Average LLC val: 120755
# Cache span (lines): 98304
# Pass: Check cache miss rate changed more than 1.0%
-# Percent diff=40.8
+# Percent diff=42.5
# Number of bits: 2
-# Average LLC val: 170978
+# Average LLC val: 172077
# Cache span (lines): 65536
# Pass: Check cache miss rate changed more than 0.0%
-# Percent diff=22.8
+# Percent diff=22.0
# Number of bits: 1
-# Average LLC val: 209950
+# Average LLC val: 209893
# Cache span (lines): 32768
ok 4 L3_CAT: test
# Starting L3_NONCONT_CAT test ...
# Mounting resctrl to "/sys/fs/resctrl"
# Write schema "L3:0=3f" to resctrl FS
# Write schema "L3:0=787" to resctrl FS # write() failed : Invalid argument
# Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
ok 5 L3_NONCONT_CAT: test
# Starting L2_NONCONT_CAT test ...
# Mounting resctrl to "/sys/fs/resctrl"
ok 6 # SKIP Hardware does not support L2_NONCONT_CAT or L2_NONCONT_CAT is disabled
# 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
-# Totals: pass:4 fail:1 xfail:0 xpass:0 skip:1 error:0
+# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:1 error:0
---
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-14 22:55 ` Reinette Chatre
@ 2025-10-15 15:47 ` Dave Martin
2025-10-15 18:48 ` Luck, Tony
2025-10-16 16:31 ` Reinette Chatre
0 siblings, 2 replies; 52+ messages in thread
From: Dave Martin @ 2025-10-15 15:47 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 10/13/25 7:36 AM, Dave Martin wrote:
> > Hi Reinette,
> >
> > On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> On 9/30/25 8:40 AM, Dave Martin wrote:
> >>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
> >>>> On 9/29/25 6:56 AM, Dave Martin wrote:
> >
> > [...]
> >
> >>>> 1) Commented schema are "inactive"
> >>>> This is unclear to me. In the MB example the commented lines show the
> >>>> finer grained controls. Since the original MB resource is an approximation
> >>>> and the hardware must already be configured to support it, would the #-prefixed
> >>>> lines not show the actual "active" configuration?
> >>>
> >>> They would show the active configuration (possibly more precisely than
> >>> "MB" does).
> >>
> >> That is how I see it also. This is specific to MB as we try to maintain
> >> backward compatibility.
> >>
> >> If we are going to make user interface changes to resource allocation then
> >> ideally it should consider all known future usage. I am trying to navigate
> >> and understand the discussion on how resctrl can support MPAM and this
> >> RDT region aware requirements.
> >>
> >> I scanned the MPAM spec and from what I understand a resource may support
> >> multiple controls at the same time, each with its own properties, and then
> >> there was this:
> >>
> >> When multiple partitioning controls are active, each affects the partition’s
> >> bandwidth usage. However, some combinations of controls may not make sense,
> >> because the regulation of that pair of controls cannot be made to work in concert.
> >>
> >> resctrl may thus present an "active configuration" that is not a configuration
> >> that "makes sense" ... this may be ok as resctrl would present what hardware
> >> supports combined with what user requested.
> >
> > This is analogous to what the MPAM spec says, though if resctrl offers
> > two different schemata for the same hardware control, the control cannot be
> > configured with both values simultaneously.
> >
> > For the MPAM hardware controls affecting the same hardware resource,
> > they can be programmed to combinations of values that have no sensible
> > interpretation, and the values can be read back just fine. The
> > performance effects may not be what the user expected / wanted, but
> > this is not directly visible to resctrl.
> >
> > So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user
> > can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that
> > will read back just as programmed. The architecture does not promise
> > what the performance effect of this will be, but resctrl does not need
> > to care.
>
> The same appears to be true for Intel RDT where the spec warns ("Undesirable
> and undefined performance effects may result if cap programming guidelines
> are not followed.") but does not seem to prevent such configurations.
Right. We _could_ block such a configuration from reaching the hardware,
if the arch backend overrides the MIN limit when the MAX limit is
written and vice-versa, when not doing to would result in crossed-over
bounds.
If software wants to program both bounds, then that would be fine: in:
# cat <<-EOF >/sys/fs/resctrl/schemata
MB_MAX: 0=128
EOF
# cat <<-EOF >/sys/fs/resctrl/schemata
MB_MIN: 0=256
MB_MAX: 0=1024
EOF
... internally programming some value >=256 before programming the
hardware with the new min bound would not stop the final requested
change to MB_MAX from working as userspace expected.
(There will be inevitable regulation glitches unless the hardware
provides a way to program both bounds atomically. MPAM doesn't; I
don't think RDT does either?)
But we only _need_ to do this if the hardware architecture forbids
programming cross bounds or says that it is unsafe to do so. So, I am
thinking that the generic code doesn't need to handle this.
[...]
> >> To be specific, the original proposal [1] introduced a set of files for
> >> a numeric control and that seems to work for existing and upcoming
> >> schema that need a value in a range. Different controls need different
> >> parameters so to integrate this solution I think it needs another parameter
> >> (presented as a directory, a file, or within a file) that indicates the
> >> type of the control so that user space knows which files/parameters to expect
> >> and how to interpret them.
> >
> > Agreed. I wasn't meaning to imply that this proposal shouldn't be
> > integrated into something more general. If we want a richer
> > description than the current one, it makes sense to incorporate bitmap
> > controls -- this just wasn't my focus.
>
> Understood.
>
> >
> >> Since different controls have different parameters we need to consider
> >> whether it is easier to create/parse unique files for each control or
> >> present all the parameters within one file with another file noting the type
> >> of control.
> >
> > Separate files works quite well for low-tech tooling built using shell
> > scripts, and this seems to follow the sysfs philosophy. Since there is
> > no need to keep re-reading these parameters, simplicity feels more
> > important than efficiency?
> >
> > But we could equally have a single file with multiple pieces of
> > information in it.
> >
> > I don't have a strong view on this.
>
> If by sysfs philosophy you men "one value per file" then resctrl split from that from
> the beginning (with the schemata file). I am also not advocating for one or the other
> at this time but believe we have some flexibility when faced with implementation
> options/challenges.
Agreed -- it works either way.
[...]
> >> At this time I am envisioning the proposal to result in something like below where
> >> there is one resource directory and one directory per schema entry with a (added by me)
> >> "schema_type" file to help user find out what the schema type is to know which files are present:
> >>
> >> MB
> >> ├── bandwidth_gran
> >> ├── delay_linear
> >> ├── MB
> >> │ ├── map
> >> │ ├── max
> >> │ ├── min
> >> │ ├── scale
> >> │ ├── schema_type
> >> │ └── unit
> >> ├── MB_HW
> >> │ ├── map
[...]
> >> ├── min_bandwidth
> >> ├── num_closids
> >> └── thread_throttle_mode
> >
> > I see no reason not to do that. Either way, older userspace just
> > ignores the new files and directories.
> >
> > Perhaps add an intermediate subdirectory to clarify the relationship
> > between the resource dir and the individual schema descriptions?
> >
> > This may also avoid the new descriptions getting mixed up with the old
> > description files.
> >
> > Say,
> >
> > info
> > ├── MB
> > │ ├── resource_schemata
> > │ │ ├── MB
> > │ │ │ ├── map
> > │ │ │ ├── max
> > │ ┆ │ ├── min
> > │ │ ┆
> > ┆ │
> > ├── MB_HW
> > │ ├── map
> > │ ┆
> > ┆
>
> Looks good to me.
OK
> >> Something else related to control that caught my eye in MPAM spec is this gem:
> >> MPAM provides discoverable vendor extensions to permit partners
> >> to invent partitioning controls.
> >
> > Yup.
> >
> > Since we have no way to know what vendor-specific controls look like or
> > what they mean, we can't do much about this.
> >
> > So, it's the vendor's job to implement support for it, and we might
> > still say no (if there is no sane way to integrate it).
>
> ack.
>
> >
> >>> MB may be hard to describe in a useful way, though -- at least in the
> >>> MPAM case, where the number of steps does not divide into 100, and the
> >>> AMD cases where the meaning of the MB control values is different.
> >>
> >> Above I do assume that MB would be represented in a new interface since it
> >> is a schema entry, if that causes trouble then we could drop it.
> >
> > Since MB is described by the existing files and the documentation,
> > perhaps this it doesn't need an additional description.
> >
> > Alternatively though, could we just have a special schema_type for this,
> > and omit the other properties? This would mean that we at least have
> > an entry for every schema.
>
> We could do this, yes.
I guess I'll go with this approach, then, and see if anyone objects.
[...]
> >>> MB: 0=50, 1=50
> >>> # MB_HW: 0=32, 1=32
> >>> # MB_MIN: 0=16, 1=16
> >>> # MB_MAX: 0=32, 1=32
> >>
> >> Could/should resctrl uncomment the lines after userspace modified them?
> >
> > The '#' wasn't meant to be a state that gets turned on and off.
>
> Thank you for clarifying.
>
> > Rather, userspace would use this to indicate which entries are
> > intentionally being modified.
>
> I see. I assume that we should not see many of these '#' entries and expect
> the ones we do see to shadow the legacy schemata entries. New schemata entries
> (that do not shadow legacy ones) should not have the '#' prefix even if
> their initial support does not include all controls.
> > So long as the entries affecting a single resource are ordered so that
> > each entry is strictly more specific than the previous entries (as
> > illustrated above), then reading schemata and stripping all the hashes
> > would allow a previous configuration to be restored; to change just one
> > entry, userspace can uncomment just that one, or write only that entry
> > (which is what I think we should recommend for new software).
>
> This is a good rule of thumb.
To avoid printing entries in the wrong order, do we want to track some
parent/child relationship between schemata.
In the above example,
* MB is the parent of MB_HW;
* MB_HW is the parent of MB_MIN and MB_MAX.
(for MPAM, at least).
When schemata is read, parents should always be printed before their
child schemata. But really, we just need to make sure that the
rdt_schema_all list is correctly ordered.
Do you think that this relationship needs to be reported to userspace?
Since the "#" convention is for backward compatibility, maybe we should
not use this for new schemata, and place the burden of managing
conflicts onto userspace going forward. What do you think?
> >>> (For hardware-specific reasons, the MPAM driver currently internally
> >>> programs the MIN bound to be a bit less than the MAX bound, when
> >>> userspace writes an "MB" entry into schemata. The key thing is that
> >>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
> >>> resctrl level, I don't that that we necessarily need to make promises
> >>> about what they can change _to_. The exact effect of MIN and MAX
> >>> bounds is likely to be hardware-dependent anyway.)
> >>
> >> MPAM has the "HARDLIM" distinction associated with these MAX values
> >> and from what I can tell this is per PARTID. Is this something that needs
> >> to be supported? To do this resctrl will need to support modifying
> >> control properties per resource group.
> >
> > Possibly. Since this is a boolean control that determines how the
> > MBW_MAX control is applied, we could perhaps present it as an
> > additional schema -- if so, it's basically orthogonal.
> >
> > | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
> >
> > or
> >
> > | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
> >
> > Does this look reasonable?
>
> It does.
OK -- note, I don't think we have any immediate plan to support this in
the MPAM driver, but it may land eventually in some form.
[...]
> >>> Regarding new userspce:
> >>>
> >>> Going forward, we can explicitly document that there should be no
> >>> conflicting or "passenger" entries in a schemata write: don't include
> >>> an entry for somehing that you don't explicitly want to set, and if
> >>> multiple entries affect the same resource, we don't promise what
> >>> happens.
> >>>
> >>> (But sadly, we can't impose that rule on existing software after the
> >>> fact.)
> >>
> >> It may thus not be worth it to make such a rule.
> >
> > Ack. Perhaps we could recommend it, though.
>
> We could, yes.
OK
[...]
> >> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual,
> >> ... ? Is it expected that MPAM's support of these should be exposed via resctrl?
> >
> > Probably not. These are best regarded as entirely separate instances
> > of MPAM; the PARTID spaces are separate. The Non-secure physical
> > address space is the only physical address space directly accessible to
> > Linux -- for the others, we can't address the MMIO registers anyway.
> >
> > For now, the other address spaces are the firmware's problem.
>
> Thank you.
No worries -- it's not too obvious from the spec!
> >> Have you considered how to express if user wants hardware to have different
> >> allocations for, for example, same PARTID at different execution levels?
> >>
> >> Reinette
> >>
> >> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
> >
> > MPAM doesn't allow different controls for a PARTID depending on the
> > exception level, but it is possible to program different PARTIDs for
> > hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0).
>
> I misunderstood this from the spec. Thank you for clarifying.
>
> >
> > I think that if we wanted to go down that road, we would want to expose
> > additional "task IDs" in resctrlfs that can be placed into groups
> > independently, say
> >
> > echo 14161:kernel >>.../some_group/tasks
> > echo 14161:user >>.../other_group/tasks
> >
> > However, inside the kernel, the boundary between work done on behalf of
> > a specific userspace task, work done on behalf of userspace in general,
> > and autonomous work inside the kernel is fuzzy and not well defined.
> >
> > For this reason, we currently only configure the PARTID for EL0. For
> > EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0).
> >
> > Hopefully this is orthogonal to the discussion of schema descriptions,
> > though ...?
>
> Yes.
OK; I suggest that we put this on one side, for now, then.
There is a discussion to be had on this, but it feels like a separate
thing.
I'll try to pull the state of this discussion together -- maybe as a
draft update to the documentation, describing the interface as proposed
so far. Does that work for you?
Cheers
--Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-15 15:47 ` Dave Martin
@ 2025-10-15 18:48 ` Luck, Tony
2025-10-16 14:50 ` Dave Martin
2025-10-16 16:31 ` Reinette Chatre
1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2025-10-15 18:48 UTC (permalink / raw)
To: Dave Martin, Chatre, Reinette
Cc: linux-kernel@vger.kernel.org, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86@kernel.org, linux-doc@vger.kernel.org
> I'll try to pull the state of this discussion together -- maybe as a
> draft update to the documentation, describing the interface as proposed
> so far. Does that work for you?
Dave,
Yes please. This discussion has explored a bunch of options. A summary
would be perfect.
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-15 18:48 ` Luck, Tony
@ 2025-10-16 14:50 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-10-16 14:50 UTC (permalink / raw)
To: Luck, Tony
Cc: Chatre, Reinette, linux-kernel@vger.kernel.org, James Morse,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, x86@kernel.org,
linux-doc@vger.kernel.org
Hi Tony,
On Wed, Oct 15, 2025 at 06:48:55PM +0000, Luck, Tony wrote:
> > I'll try to pull the state of this discussion together -- maybe as a
> > draft update to the documentation, describing the interface as proposed
> > so far. Does that work for you?
>
> Dave,
>
> Yes please. This discussion has explored a bunch of options. A summary
> would be perfect.
OK -- next week sometime, probably.
I'll try to keep it high-level, since we don't have a final design yet...
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-15 15:18 ` Dave Martin
@ 2025-10-16 15:57 ` Reinette Chatre
2025-10-17 15:52 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-10-16 15:57 UTC (permalink / raw)
To: Dave Martin
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 10/15/25 8:18 AM, Dave Martin wrote:
> Hi Reinette,
>
> Just following up on the skipped L2_NONCONT_CAT test -- see below.
Thank you very much.
>
> [...]
>
> On Mon, Sep 22, 2025 at 03:39:47PM +0100, Dave Martin wrote:
>
> [...]
>
>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote:
>
> [...]
>
>>> On 9/2/25 9:24 AM, Dave Martin wrote:
>
> [...]
>
>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+
>>>> the other tests except for the NONCONT_CAT tests, which do not seem to
>>>> be supported in my configuration -- and have nothing to do with the
>>>> code touched by this patch).
>>>
>>> Is the NONCONT_CAT test failing (i.e printing "not ok")?
>>>
>>> The NONCONT_CAT tests may print error messages as debug information as part of
>>> running, but these errors are expected as part of the test. The test should accurately
>>> state whether it passed or failed though. For example, below attempts to write
>>> a non-contiguous CBM to a system that does not support non-contiguous masks.
>>> This fails as expected, error messages printed as debugging and thus the test passes
>>> with an "ok".
>>>
>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument
>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
>>> ok 5 L3_NONCONT_CAT: test
>>
>> I don't think that this was anything to do with my changes, but I don't
>> still seem to have the test output. (Since this test has to do with
>> bitmap schemata (?), it seemed unlikely to be affected by changes to
>> bw_validate().)
>>
>> I'll need to re-test with and without this patch to check whether it
>> makes any difference.
>
> I finally got around to testing this on top of -rc1.
>
> Disregarding trivial differences, the patched version (+++) doesn't
> seem to introduce any regressions over the vanilla version (---)
> (below). (The CMT test actually failed with an out-of-tolerance result
> on the vanilla kernel only. Possibly there was some adverse system
> load interfering.)
My first thought is that this is another unfortunate consequence of the resctrl
performance-as-functional tests.
The percentage difference you encountered is quite large and that
prompted me to take a closer look and it does look to me as though the CMT
can be improved. (Whether we should spend more effort on these performance tests
instead of creating new deterministic functional tests is another topic.)
>
>
> Looking at the code, it seems that L2_NONCONT_CAT is not gated by any
> config or mount option. I think this is just a feature that my
> hardware doesn't support (?)
Yes, this is how I also interpret the test output.
Focusing on the CMT test ...
> # Starting CMT test ...
> # Mounting resctrl to "/sys/fs/resctrl"
> # Cache size :23068672
> # Writing benchmark parameters to resctrl FS
> -# Benchmark PID: 5135
> +# Benchmark PID: 4970
> # Checking for pass/fail
> -# Fail: Check cache miss rate within 15%
> -# Percent diff=24
> +# Pass: Check cache miss rate within 15%
> +# Percent diff=4
> # Number of bits: 5
> -# Average LLC val: 7942963
> +# Average LLC val: 10918297
> # Cache span (bytes): 10485760
> -not ok 3 CMT: test
> +ok 3 CMT: test
A 24% difference followed by a 4% difference is a big swing. On a high level
the CMT test creates a new resource group with only the test assigned to it. The test
initializes and accesses a buffer a couple of time while measuring cache occupancy.
"success" is when the cache occupancy is within 15% of the buffer size.
I noticed a couple of places where the test is susceptible to interference and
system architecture.
1) The cache allocation of test's resource group overlaps with the rest of the
system. On a busy system it is thus likely that the test's cache entries may be
evicted.
2) The test does not account for cache architecture where, for example, there may be
an L2 cache that can accommodate a large part of the buffer and thus not be
reflected in the LLC occupancy count.
I started experimenting to see what it will take to reduce interference and ended up
with a change like below that isolates the cache portions between the test and the
rest of the system and if L2 cache allocation is possible, reduces the amount of L2
cache the test can allocate into as much as possible. This opened up another tangent
where the size of cache portion is the same as the buffer while it is not realistic
to expect a user space buffer to fill into the cache so nicely.
Even with these changes I was not able to get the percentages to drop significantly
on my system but it may help to reduce the swings in numbers observed.
But, I do not see how work like this helps to improve resctrl health (compared to,
for example, just increasing the "success" percentage).
diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
index d09e693dc739..494e98aa8b69 100644
--- a/tools/testing/selftests/resctrl/cmt_test.c
+++ b/tools/testing/selftests/resctrl/cmt_test.c
@@ -19,12 +19,22 @@
#define CON_MON_LCC_OCCUP_PATH \
"%s/%s/mon_data/mon_L3_%02d/llc_occupancy"
-static int cmt_init(const struct resctrl_val_param *param, int domain_id)
+static int cmt_init(const struct resctrl_test *test,
+ const struct user_params *uparams,
+ const struct resctrl_val_param *param, int domain_id)
{
+ char schemata[64];
+ int ret;
+
sprintf(llc_occup_path, CON_MON_LCC_OCCUP_PATH, RESCTRL_PATH,
param->ctrlgrp, domain_id);
- return 0;
+ snprintf(schemata, sizeof(schemata), "%lx", param->mask);
+ ret = write_schemata(param->ctrlgrp, schemata, uparams->cpu, test->resource);
+ if (!ret && !strcmp(test->resource, "L3") && resctrl_resource_exists("L2"))
+ ret = write_schemata(param->ctrlgrp, "0x1", uparams->cpu, "L2");
+
+ return ret;
}
static int cmt_setup(const struct resctrl_test *test,
@@ -119,6 +129,7 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
unsigned long cache_total_size = 0;
int n = uparams->bits ? : 5;
unsigned long long_mask;
+ char schemata[64];
int count_of_bits;
size_t span;
int ret;
@@ -162,6 +173,11 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
param.fill_buf = &fill_buf;
}
+ snprintf(schemata, sizeof(schemata), "%lx", ~param.mask & long_mask);
+ ret = write_schemata("", schemata, uparams->cpu, test->resource);
+ if (ret)
+ return ret;
+
remove(RESULT_FILE_NAME);
ret = resctrl_val(test, uparams, ¶m);
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
index c7e9adc0368f..cd4c715b7ffd 100644
--- a/tools/testing/selftests/resctrl/mba_test.c
+++ b/tools/testing/selftests/resctrl/mba_test.c
@@ -17,7 +17,9 @@
#define ALLOCATION_MIN 10
#define ALLOCATION_STEP 10
-static int mba_init(const struct resctrl_val_param *param, int domain_id)
+static int mba_init(const struct resctrl_test *test,
+ const struct user_params *uparams,
+ const struct resctrl_val_param *param, int domain_id)
{
int ret;
diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
index 84d8bc250539..58201f844740 100644
--- a/tools/testing/selftests/resctrl/mbm_test.c
+++ b/tools/testing/selftests/resctrl/mbm_test.c
@@ -83,7 +83,9 @@ static int check_results(size_t span)
return ret;
}
-static int mbm_init(const struct resctrl_val_param *param, int domain_id)
+static int mbm_init(const struct resctrl_test *test,
+ const struct user_params *uparams,
+ const struct resctrl_val_param *param, int domain_id)
{
int ret;
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index cd3adfc14969..9853bd746392 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -133,7 +133,9 @@ struct resctrl_val_param {
char filename[64];
unsigned long mask;
int num_of_runs;
- int (*init)(const struct resctrl_val_param *param,
+ int (*init)(const struct resctrl_test *test,
+ const struct user_params *uparams,
+ const struct resctrl_val_param *param,
int domain_id);
int (*setup)(const struct resctrl_test *test,
const struct user_params *uparams,
diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c
index 7c08e936572d..a5a8badb83d4 100644
--- a/tools/testing/selftests/resctrl/resctrl_val.c
+++ b/tools/testing/selftests/resctrl/resctrl_val.c
@@ -569,7 +569,7 @@ int resctrl_val(const struct resctrl_test *test,
goto reset_affinity;
if (param->init) {
- ret = param->init(param, domain_id);
+ ret = param->init(test, uparams, param, domain_id);
if (ret)
goto reset_affinity;
}
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-15 15:47 ` Dave Martin
2025-10-15 18:48 ` Luck, Tony
@ 2025-10-16 16:31 ` Reinette Chatre
2025-10-17 14:17 ` Dave Martin
1 sibling, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-10-16 16:31 UTC (permalink / raw)
To: Dave Martin
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 10/15/25 8:47 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/13/25 7:36 AM, Dave Martin wrote:
>>> Hi Reinette,
>>>
>>> On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 9/30/25 8:40 AM, Dave Martin wrote:
>>>>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote:
>>>>>> On 9/29/25 6:56 AM, Dave Martin wrote:
>>>
>>> [...]
>>>
>>>>>> 1) Commented schema are "inactive"
>>>>>> This is unclear to me. In the MB example the commented lines show the
>>>>>> finer grained controls. Since the original MB resource is an approximation
>>>>>> and the hardware must already be configured to support it, would the #-prefixed
>>>>>> lines not show the actual "active" configuration?
>>>>>
>>>>> They would show the active configuration (possibly more precisely than
>>>>> "MB" does).
>>>>
>>>> That is how I see it also. This is specific to MB as we try to maintain
>>>> backward compatibility.
>>>>
>>>> If we are going to make user interface changes to resource allocation then
>>>> ideally it should consider all known future usage. I am trying to navigate
>>>> and understand the discussion on how resctrl can support MPAM and this
>>>> RDT region aware requirements.
>>>>
>>>> I scanned the MPAM spec and from what I understand a resource may support
>>>> multiple controls at the same time, each with its own properties, and then
>>>> there was this:
>>>>
>>>> When multiple partitioning controls are active, each affects the partition’s
>>>> bandwidth usage. However, some combinations of controls may not make sense,
>>>> because the regulation of that pair of controls cannot be made to work in concert.
>>>>
>>>> resctrl may thus present an "active configuration" that is not a configuration
>>>> that "makes sense" ... this may be ok as resctrl would present what hardware
>>>> supports combined with what user requested.
>>>
>>> This is analogous to what the MPAM spec says, though if resctrl offers
>>> two different schemata for the same hardware control, the control cannot be
>>> configured with both values simultaneously.
>>>
>>> For the MPAM hardware controls affecting the same hardware resource,
>>> they can be programmed to combinations of values that have no sensible
>>> interpretation, and the values can be read back just fine. The
>>> performance effects may not be what the user expected / wanted, but
>>> this is not directly visible to resctrl.
>>>
>>> So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user
>>> can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that
>>> will read back just as programmed. The architecture does not promise
>>> what the performance effect of this will be, but resctrl does not need
>>> to care.
>>
>> The same appears to be true for Intel RDT where the spec warns ("Undesirable
>> and undefined performance effects may result if cap programming guidelines
>> are not followed.") but does not seem to prevent such configurations.
>
> Right. We _could_ block such a configuration from reaching the hardware,
> if the arch backend overrides the MIN limit when the MAX limit is
> written and vice-versa, when not doing to would result in crossed-over
> bounds.
>
> If software wants to program both bounds, then that would be fine: in:
>
> # cat <<-EOF >/sys/fs/resctrl/schemata
> MB_MAX: 0=128
> EOF
>
> # cat <<-EOF >/sys/fs/resctrl/schemata
> MB_MIN: 0=256
> MB_MAX: 0=1024
> EOF
>
> ... internally programming some value >=256 before programming the
> hardware with the new min bound would not stop the final requested
> change to MB_MAX from working as userspace expected.
>
> (There will be inevitable regulation glitches unless the hardware
> provides a way to program both bounds atomically. MPAM doesn't; I
> don't think RDT does either?)
>
>
> But we only _need_ to do this if the hardware architecture forbids
> programming cross bounds or says that it is unsafe to do so. So, I am
> thinking that the generic code doesn't need to handle this.
>
> [...]
Sounds reasonable to me.
...
>>>>> MB: 0=50, 1=50
>>>>> # MB_HW: 0=32, 1=32
>>>>> # MB_MIN: 0=16, 1=16
>>>>> # MB_MAX: 0=32, 1=32
>>>>
>>>> Could/should resctrl uncomment the lines after userspace modified them?
>>>
>>> The '#' wasn't meant to be a state that gets turned on and off.
>>
>> Thank you for clarifying.
>>
>>> Rather, userspace would use this to indicate which entries are
>>> intentionally being modified.
>>
>> I see. I assume that we should not see many of these '#' entries and expect
>> the ones we do see to shadow the legacy schemata entries. New schemata entries
>> (that do not shadow legacy ones) should not have the '#' prefix even if
>> their initial support does not include all controls.
>>> So long as the entries affecting a single resource are ordered so that
>>> each entry is strictly more specific than the previous entries (as
>>> illustrated above), then reading schemata and stripping all the hashes
>>> would allow a previous configuration to be restored; to change just one
>>> entry, userspace can uncomment just that one, or write only that entry
>>> (which is what I think we should recommend for new software).
>>
>> This is a good rule of thumb.
>
> To avoid printing entries in the wrong order, do we want to track some
> parent/child relationship between schemata.
>
> In the above example,
>
> * MB is the parent of MB_HW;
>
> * MB_HW is the parent of MB_MIN and MB_MAX.
>
> (for MPAM, at least).
Could you please elaborate this relationship? I envisioned the MB_HW to be
something similar to Intel RDT's "optimal" bandwidth setting ... something
that is expected to be somewhere between the "min" and the "max".
But, now I think I'm a bit lost in MPAM since it is not clear to me what
MB_HW represents ... would this be the "memory bandwidth portion
partitioning"? Although, that uses a completely different format from
"min" and "max".
>
> When schemata is read, parents should always be printed before their
> child schemata. But really, we just need to make sure that the
> rdt_schema_all list is correctly ordered.
>
>
> Do you think that this relationship needs to be reported to userspace?
You brought up the topic of relationships in
https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
to learn more from the MPAM spec where I learned and went on tangent about all
the other possible namespaces without circling back.
I was hoping that the namespace prefix would make the relationships clear,
something like <resource>_<control>, but I did not expect another layer in
the hierarchy like your example above. The idea of "parent" and "child" is
also not obvious to me at this point. resctrl gives us a "resource" to start
with and we are now discussing multiple controls per resource. Could you please
elaborate what you see as "parent" and "child"?
We do have the info directory available to express relationships and a
hierarchy is already starting to taking shape there.
>
> Since the "#" convention is for backward compatibility, maybe we should
> not use this for new schemata, and place the burden of managing
> conflicts onto userspace going forward. What do you think?
I agree. The way I understand this is that the '#' will only be used for
new controls that shadow the default/current controls of the legacy resources.
I do not expect that the prefix will be needed for new resources, even if
the initial support of a new resource does not include all possible controls.
>>>>> (For hardware-specific reasons, the MPAM driver currently internally
>>>>> programs the MIN bound to be a bit less than the MAX bound, when
>>>>> userspace writes an "MB" entry into schemata. The key thing is that
>>>>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the
>>>>> resctrl level, I don't that that we necessarily need to make promises
>>>>> about what they can change _to_. The exact effect of MIN and MAX
>>>>> bounds is likely to be hardware-dependent anyway.)
>>>>
>>>> MPAM has the "HARDLIM" distinction associated with these MAX values
>>>> and from what I can tell this is per PARTID. Is this something that needs
>>>> to be supported? To do this resctrl will need to support modifying
>>>> control properties per resource group.
>>>
>>> Possibly. Since this is a boolean control that determines how the
>>> MBW_MAX control is applied, we could perhaps present it as an
>>> additional schema -- if so, it's basically orthogonal.
>>>
>>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>>>
>>> or
>>>
>>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>>>
>>> Does this look reasonable?
>>
>> It does.
>
> OK -- note, I don't think we have any immediate plan to support this in
> the MPAM driver, but it may land eventually in some form.
>
ack.
...
>>>> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual,
>>>> ... ? Is it expected that MPAM's support of these should be exposed via resctrl?
>>>
>>> Probably not. These are best regarded as entirely separate instances
>>> of MPAM; the PARTID spaces are separate. The Non-secure physical
>>> address space is the only physical address space directly accessible to
>>> Linux -- for the others, we can't address the MMIO registers anyway.
>>>
>>> For now, the other address spaces are the firmware's problem.
>>
>> Thank you.
>
> No worries -- it's not too obvious from the spec!
>
>>>> Have you considered how to express if user wants hardware to have different
>>>> allocations for, for example, same PARTID at different execution levels?
>>>>
>>>> Reinette
>>>>
>>>> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
>>>
>>> MPAM doesn't allow different controls for a PARTID depending on the
>>> exception level, but it is possible to program different PARTIDs for
>>> hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0).
>>
>> I misunderstood this from the spec. Thank you for clarifying.
>>
>>>
>>> I think that if we wanted to go down that road, we would want to expose
>>> additional "task IDs" in resctrlfs that can be placed into groups
>>> independently, say
>>>
>>> echo 14161:kernel >>.../some_group/tasks
>>> echo 14161:user >>.../other_group/tasks
>>>
>>> However, inside the kernel, the boundary between work done on behalf of
>>> a specific userspace task, work done on behalf of userspace in general,
>>> and autonomous work inside the kernel is fuzzy and not well defined.
>>>
>>> For this reason, we currently only configure the PARTID for EL0. For
>>> EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0).
>>>
>>> Hopefully this is orthogonal to the discussion of schema descriptions,
>>> though ...?
>>
>> Yes.
>
> OK; I suggest that we put this on one side, for now, then.
>
> There is a discussion to be had on this, but it feels like a separate
> thing.
agreed.
>
>
> I'll try to pull the state of this discussion together -- maybe as a
> draft update to the documentation, describing the interface as proposed
> so far. Does that work for you?
It does. Thank you very much for taking this on.
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-16 16:31 ` Reinette Chatre
@ 2025-10-17 14:17 ` Dave Martin
2025-10-17 15:59 ` Reinette Chatre
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-10-17 14:17 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 10/15/25 8:47 AM, Dave Martin wrote:
> > Hi Reinette,
> >
> > On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> On 10/13/25 7:36 AM, Dave Martin wrote:
[...]
> >>> [...] if we offer independent schemata for MBW_MIN and MBW_MAX, the user
> >>> can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that
> >>> will read back just as programmed. The architecture does not promise
> >>> what the performance effect of this will be, but resctrl does not need
> >>> to care.
> >>
> >> The same appears to be true for Intel RDT where the spec warns ("Undesirable
> >> and undefined performance effects may result if cap programming guidelines
> >> are not followed.") but does not seem to prevent such configurations.
> >
> > Right. We _could_ block such a configuration from reaching the hardware,
> > if the arch backend overrides the MIN limit when the MAX limit is
> > written and vice-versa, when not doing to would result in crossed-over
> > bounds.
> >
> > If software wants to program both bounds, then that would be fine: in:
> >
> > # cat <<-EOF >/sys/fs/resctrl/schemata
> > MB_MAX: 0=128
> > EOF
> >
> > # cat <<-EOF >/sys/fs/resctrl/schemata
> > MB_MIN: 0=256
> > MB_MAX: 0=1024
> > EOF
> >
> > ... internally programming some value >=256 before programming the
> > hardware with the new min bound would not stop the final requested
> > change to MB_MAX from working as userspace expected.
> >
> > (There will be inevitable regulation glitches unless the hardware
> > provides a way to program both bounds atomically. MPAM doesn't; I
> > don't think RDT does either?)
> >
> >
> > But we only _need_ to do this if the hardware architecture forbids
> > programming cross bounds or says that it is unsafe to do so. So, I am
> > thinking that the generic code doesn't need to handle this.
> >
> > [...]
>
> Sounds reasonable to me.
OK
[...]
> >>> So long as the entries affecting a single resource are ordered so that
> >>> each entry is strictly more specific than the previous entries (as
> >>> illustrated above), then reading schemata and stripping all the hashes
> >>> would allow a previous configuration to be restored; to change just one
> >>> entry, userspace can uncomment just that one, or write only that entry
> >>> (which is what I think we should recommend for new software).
> >>
> >> This is a good rule of thumb.
> >
> > To avoid printing entries in the wrong order, do we want to track some
> > parent/child relationship between schemata.
> >
> > In the above example,
> >
> > * MB is the parent of MB_HW;
> >
> > * MB_HW is the parent of MB_MIN and MB_MAX.
> >
> > (for MPAM, at least).
>
> Could you please elaborate this relationship? I envisioned the MB_HW to be
> something similar to Intel RDT's "optimal" bandwidth setting ... something
> that is expected to be somewhere between the "min" and the "max".
>
> But, now I think I'm a bit lost in MPAM since it is not clear to me what
> MB_HW represents ... would this be the "memory bandwidth portion
> partitioning"? Although, that uses a completely different format from
> "min" and "max".
I confess that I'm thinking with an MPAM mindset here.
Some pseudocode might help to illustrate how these might interact:
set_MB(partid, val) {
set_MB_HW(partid, percent_to_hw_val(val));
}
set_MB_HW(partid, val) {
set_MB_MAX(partid, val);
/*
* Hysteresis to avoid steady flows from ping-ponging
* between low and high priority:
*/
if (hardware_has_MB_MIN())
set_MB_MIN(partid, val * 95%);
}
set_MB_MIN(partid, val) {
mpam->MBW_MIN[partid] = val;
}
set_MB_MAX(partid, val) {
mpam->MBW_MAX[partid] = val;
}
with
get_MB(partid) {
return hw_val_to_percent(get_MB_HW(partid));
}
get_MB_HW(partid) { return get_MB_MAX(partid); }
get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; }
get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
The parent/child relationship I suggested is basically the call-graph
of this pseudocode. These could all be exposed as resctrl schemata,
but the children provide finer / more broken-down control than the
parents. Reading a parent provides a merged or approximated view of
the configuration of the child schemata.
In particular,
set_child(partid, get_child(partid));
get_parent(partid);
yields the same result as
get_parent(partid);
but will not be true in general, if the roles of parent and child are
reversed.
I think still this holds true if implementing an "MB_HW" schema for
newer revisions of RDT. The pseudocode would be different, but there
will still be a tree-like call graph (?)
Going back to MPAM:
Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or
MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory
bandwidth is split into discrete, non-overlapping chunks, and each
PARTID is configured with a bitmap saying which chunks it can use.
This could be done by time-slicing, or controlling which memory
controllers/ports a PARTID can issue requests to, or something like
that.
If the MBW_MAX control isn't implemented, then the MPAM current driver
maps this bitmap control onto the resctrl "MB" schema in a simple way,
but we are considering dropping this, since the allocation model
(explicit, static allocation of discrete resources) is not really the
same as for RDT MBA (dynamic prioritisation based on recent resource
consumption).
Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend
on an equal footing for memory bandwidth until one exceeds 50% (when it
will start to be penalised). Prorgamming bitmaps can't have the same
effect. For example, with { 1100, 0110, 0011, 1001 }, no group can use
more than 50% of the full bandwidth, whatever happens. Worse, certain
pairs of groups are fully isolated from each other, while others are
always in contention, not matter how little actual traffic is generated.
This is potentially useful, but it's not the same as the MIN/MAX model.
So, it may make more sense to expose this as a separate, bitmap schema.
(The same goes for "Proportional stride" partitioning. It's another,
different, control for memory bandwidth. As of today, I don't think
that we have a reference platform for experimenting with either of
these.)
> > When schemata is read, parents should always be printed before their
> > child schemata. But really, we just need to make sure that the
> > rdt_schema_all list is correctly ordered.
> >
> >
> > Do you think that this relationship needs to be reported to userspace?
>
> You brought up the topic of relationships in
> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
> to learn more from the MPAM spec where I learned and went on tangent about all
> the other possible namespaces without circling back.
>
> I was hoping that the namespace prefix would make the relationships clear,
> something like <resource>_<control>, but I did not expect another layer in
> the hierarchy like your example above. The idea of "parent" and "child" is
> also not obvious to me at this point. resctrl gives us a "resource" to start
> with and we are now discussing multiple controls per resource. Could you please
> elaborate what you see as "parent" and "child"?
See above -- the parent/child concept is not an MPAM thing; apologies
if I didn't make that clear.
> We do have the info directory available to express relationships and a
> hierarchy is already starting to taking shape there.
I'm wondering whether using a common prefix will be future-proof? It
may not always be clear which part of a name counts as the common
prefix.
There were already discussions about appending a number to a schema
name in order to control different memory regions -- that's another
prefix/suffix relationship, if so...
We could handle all of this by documenting all the relationships
explicitly. But I'm thinking that it could be easier for maintanance
if the resctrl core code has explicit knowledge of the relationships.
That said, using a common prefix is still a good idea. But maybe we
shouldn't lean on it too heavily as a way of actually describing the
relationships?
> > Since the "#" convention is for backward compatibility, maybe we should
> > not use this for new schemata, and place the burden of managing
> > conflicts onto userspace going forward. What do you think?
>
> I agree. The way I understand this is that the '#' will only be used for
> new controls that shadow the default/current controls of the legacy resources.
> I do not expect that the prefix will be needed for new resources, even if
> the initial support of a new resource does not include all possible controls.
OK. Note, relating this to the above, the # could be interpreted as
meaning "this is a child of some other schema; don't mess with it
unless you know what you are doing".
Older software doesn't understand the relationships, so this is just
there to stop it from shooting itself in the foot.
[...]
> >>>> MPAM has the "HARDLIM" distinction associated with these MAX values
> >>>> and from what I can tell this is per PARTID. Is this something that needs
> >>>> to be supported? To do this resctrl will need to support modifying
> >>>> control properties per resource group.
> >>>
> >>> Possibly. Since this is a boolean control that determines how the
> >>> MBW_MAX control is applied, we could perhaps present it as an
> >>> additional schema -- if so, it's basically orthogonal.
> >>>
> >>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
> >>>
> >>> or
> >>>
> >>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
> >>>
> >>> Does this look reasonable?
> >>
> >> It does.
> >
> > OK -- note, I don't think we have any immediate plan to support this in
> > the MPAM driver, but it may land eventually in some form.
> >
>
> ack.
(Or, of course, anything else that achieves the same goal...)
[...]
> >>> MPAM doesn't allow different controls for a PARTID depending on the
> >>> exception level, but it is possible to program different PARTIDs for
> >>> hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0).
[...]
> >>> Hopefully this is orthogonal to the discussion of schema descriptions,
> >>> though ...?
> >>
> >> Yes.
> >
> > OK; I suggest that we put this on one side, for now, then.
> >
> > There is a discussion to be had on this, but it feels like a separate
> > thing.
>
> agreed.
>
> >
> >
> > I'll try to pull the state of this discussion together -- maybe as a
> > draft update to the documentation, describing the interface as proposed
> > so far. Does that work for you?
>
> It does. Thank you very much for taking this on.
>
> Reinette
OK, I'll aim to follow up on this next week.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-16 15:57 ` Reinette Chatre
@ 2025-10-17 15:52 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-10-17 15:52 UTC (permalink / raw)
To: Reinette Chatre
Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi there,
On Thu, Oct 16, 2025 at 08:57:59AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 10/15/25 8:18 AM, Dave Martin wrote:
[...]
> > I finally got around to testing this on top of -rc1.
> >
> > Disregarding trivial differences, the patched version (+++) doesn't
> > seem to introduce any regressions over the vanilla version (---)
> > (below). (The CMT test actually failed with an out-of-tolerance result
> > on the vanilla kernel only. Possibly there was some adverse system
> > load interfering.)
>
> My first thought is that this is another unfortunate consequence of the resctrl
> performance-as-functional tests.
> The percentage difference you encountered is quite large and that
> prompted me to take a closer look and it does look to me as though the CMT
> can be improved. (Whether we should spend more effort on these performance tests
> instead of creating new deterministic functional tests is another topic.)
I ran the tests soon after booting a full-fat OS, which may not have helped.
In an ideal world we sort of want two RDT instances, one under test,
and one outside it to isolate it from the rest of the system... but
that would require extra hardware :(
I'll aim to just boot to a shell the next time I run these tests.
> > Looking at the code, it seems that L2_NONCONT_CAT is not gated by any
> > config or mount option. I think this is just a feature that my
> > hardware doesn't support (?)
>
> Yes, this is how I also interpret the test output.
>
> Focusing on the CMT test ...
>
> > # Starting CMT test ...
> > # Mounting resctrl to "/sys/fs/resctrl"
> > # Cache size :23068672
> > # Writing benchmark parameters to resctrl FS
> > -# Benchmark PID: 5135
> > +# Benchmark PID: 4970
> > # Checking for pass/fail
> > -# Fail: Check cache miss rate within 15%
> > -# Percent diff=24
> > +# Pass: Check cache miss rate within 15%
> > +# Percent diff=4
> > # Number of bits: 5
> > -# Average LLC val: 7942963
> > +# Average LLC val: 10918297
> > # Cache span (bytes): 10485760
> > -not ok 3 CMT: test
> > +ok 3 CMT: test
>
> A 24% difference followed by a 4% difference is a big swing. On a high level
> the CMT test creates a new resource group with only the test assigned to it. The test
> initializes and accesses a buffer a couple of time while measuring cache occupancy.
> "success" is when the cache occupancy is within 15% of the buffer size.
>
> I noticed a couple of places where the test is susceptible to interference and
> system architecture.
> 1) The cache allocation of test's resource group overlaps with the rest of the
> system. On a busy system it is thus likely that the test's cache entries may be
> evicted.
> 2) The test does not account for cache architecture where, for example, there may be
> an L2 cache that can accommodate a large part of the buffer and thus not be
> reflected in the LLC occupancy count.
I suppose the test could take all the cache sizes into account, but
this could get complicated -- for a statistical test, it may not be worth it.
Can we probe it just by setting the CAT mask to all ones (maybe
excluding one bit to give to the default control group) and then
flooding with data until the LLC usage stops increasing?
> I started experimenting to see what it will take to reduce interference and ended up
> with a change like below that isolates the cache portions between the test and the
> rest of the system and if L2 cache allocation is possible, reduces the amount of L2
> cache the test can allocate into as much as possible. This opened up another tangent
> where the size of cache portion is the same as the buffer while it is not realistic
> to expect a user space buffer to fill into the cache so nicely.
>
> Even with these changes I was not able to get the percentages to drop significantly
> on my system but it may help to reduce the swings in numbers observed.
>
> But, I do not see how work like this helps to improve resctrl health (compared to,
> for example, just increasing the "success" percentage).
If I understand correctly, this programs the default control group with
the inverse of the test's CAT mask, so that there is no overlap? That
seems reasonable, if so.
I wonder whether bandwidth contention is also having an effect. Would
programming the default control and test control group with MB values
that don't add up to more than 100% help?
Cheers
---Dave
> diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
> index d09e693dc739..494e98aa8b69 100644
> --- a/tools/testing/selftests/resctrl/cmt_test.c
> +++ b/tools/testing/selftests/resctrl/cmt_test.c
> @@ -19,12 +19,22 @@
> #define CON_MON_LCC_OCCUP_PATH \
> "%s/%s/mon_data/mon_L3_%02d/llc_occupancy"
>
> -static int cmt_init(const struct resctrl_val_param *param, int domain_id)
> +static int cmt_init(const struct resctrl_test *test,
> + const struct user_params *uparams,
> + const struct resctrl_val_param *param, int domain_id)
> {
> + char schemata[64];
> + int ret;
> +
> sprintf(llc_occup_path, CON_MON_LCC_OCCUP_PATH, RESCTRL_PATH,
> param->ctrlgrp, domain_id);
>
> - return 0;
> + snprintf(schemata, sizeof(schemata), "%lx", param->mask);
> + ret = write_schemata(param->ctrlgrp, schemata, uparams->cpu, test->resource);
> + if (!ret && !strcmp(test->resource, "L3") && resctrl_resource_exists("L2"))
> + ret = write_schemata(param->ctrlgrp, "0x1", uparams->cpu, "L2");
> +
> + return ret;
> }
[...]
> @@ -162,6 +173,11 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
[...]
> + snprintf(schemata, sizeof(schemata), "%lx", ~param.mask & long_mask);
> + ret = write_schemata("", schemata, uparams->cpu, test->resource);
> + if (ret)
> + return ret;
> +
> remove(RESULT_FILE_NAME);
>
> ret = resctrl_val(test, uparams, ¶m);
> diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
[...]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-17 14:17 ` Dave Martin
@ 2025-10-17 15:59 ` Reinette Chatre
2025-10-20 15:50 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Reinette Chatre @ 2025-10-17 15:59 UTC (permalink / raw)
To: Dave Martin
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On 10/17/25 7:17 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/15/25 8:47 AM, Dave Martin wrote:
>>> Hi Reinette,
>>>
>>> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote:
>>>> Hi Dave,
>>>>
>>>> On 10/13/25 7:36 AM, Dave Martin wrote:
...
>>>>> So long as the entries affecting a single resource are ordered so that
>>>>> each entry is strictly more specific than the previous entries (as
>>>>> illustrated above), then reading schemata and stripping all the hashes
>>>>> would allow a previous configuration to be restored; to change just one
>>>>> entry, userspace can uncomment just that one, or write only that entry
>>>>> (which is what I think we should recommend for new software).
>>>>
>>>> This is a good rule of thumb.
>>>
>>> To avoid printing entries in the wrong order, do we want to track some
>>> parent/child relationship between schemata.
>>>
>>> In the above example,
>>>
>>> * MB is the parent of MB_HW;
>>>
>>> * MB_HW is the parent of MB_MIN and MB_MAX.
>>>
>>> (for MPAM, at least).
>>
>> Could you please elaborate this relationship? I envisioned the MB_HW to be
>> something similar to Intel RDT's "optimal" bandwidth setting ... something
>> that is expected to be somewhere between the "min" and the "max".
>>
>> But, now I think I'm a bit lost in MPAM since it is not clear to me what
>> MB_HW represents ... would this be the "memory bandwidth portion
>> partitioning"? Although, that uses a completely different format from
>> "min" and "max".
>
> I confess that I'm thinking with an MPAM mindset here.
>
> Some pseudocode might help to illustrate how these might interact:
>
> set_MB(partid, val) {
> set_MB_HW(partid, percent_to_hw_val(val));
> }
>
> set_MB_HW(partid, val) {
> set_MB_MAX(partid, val);
>
> /*
> * Hysteresis to avoid steady flows from ping-ponging
> * between low and high priority:
> */
> if (hardware_has_MB_MIN())
> set_MB_MIN(partid, val * 95%);
> }
>
> set_MB_MIN(partid, val) {
> mpam->MBW_MIN[partid] = val;
> }
>
> set_MB_MAX(partid, val) {
> mpam->MBW_MAX[partid] = val;
> }
>
> with
>
> get_MB(partid) {
> return hw_val_to_percent(get_MB_HW(partid));
> }
>
> get_MB_HW(partid) { return get_MB_MAX(partid); }
>
> get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; }
>
> get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
>
>
> The parent/child relationship I suggested is basically the call-graph
> of this pseudocode. These could all be exposed as resctrl schemata,
> but the children provide finer / more broken-down control than the
> parents. Reading a parent provides a merged or approximated view of
> the configuration of the child schemata.
>
> In particular,
>
> set_child(partid, get_child(partid));
> get_parent(partid);
>
> yields the same result as
>
> get_parent(partid);
>
> but will not be true in general, if the roles of parent and child are
> reversed.
>
> I think still this holds true if implementing an "MB_HW" schema for
> newer revisions of RDT. The pseudocode would be different, but there
> will still be a tree-like call graph (?)
Thank you very much for the example. I missed in earlier examples that
MB_HW was being controlled via MB_MAX and MB_MIN.
I do not expect such a dependence or tree-like call graph for RDT where
the closest equivalent (termed "optimal") is programmed independently from
min and max.
>
>
> Going back to MPAM:
>
> Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or
> MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory
> bandwidth is split into discrete, non-overlapping chunks, and each
> PARTID is configured with a bitmap saying which chunks it can use.
> This could be done by time-slicing, or controlling which memory
> controllers/ports a PARTID can issue requests to, or something like
> that.
>
> If the MBW_MAX control isn't implemented, then the MPAM current driver
> maps this bitmap control onto the resctrl "MB" schema in a simple way,
> but we are considering dropping this, since the allocation model
> (explicit, static allocation of discrete resources) is not really the
> same as for RDT MBA (dynamic prioritisation based on recent resource
> consumption).
>
> Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend
> on an equal footing for memory bandwidth until one exceeds 50% (when it
> will start to be penalised). Prorgamming bitmaps can't have the same
> effect. For example, with { 1100, 0110, 0011, 1001 }, no group can use
> more than 50% of the full bandwidth, whatever happens. Worse, certain
> pairs of groups are fully isolated from each other, while others are
> always in contention, not matter how little actual traffic is generated.
> This is potentially useful, but it's not the same as the MIN/MAX model.
>
> So, it may make more sense to expose this as a separate, bitmap schema.
>
> (The same goes for "Proportional stride" partitioning. It's another,
> different, control for memory bandwidth. As of today, I don't think
> that we have a reference platform for experimenting with either of
> these.)
Thank you.
>
>
>>> When schemata is read, parents should always be printed before their
>>> child schemata. But really, we just need to make sure that the
>>> rdt_schema_all list is correctly ordered.
>>>
>>>
>>> Do you think that this relationship needs to be reported to userspace?
>>
>> You brought up the topic of relationships in
>> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me
>> to learn more from the MPAM spec where I learned and went on tangent about all
>> the other possible namespaces without circling back.
>>
>> I was hoping that the namespace prefix would make the relationships clear,
>> something like <resource>_<control>, but I did not expect another layer in
>> the hierarchy like your example above. The idea of "parent" and "child" is
>> also not obvious to me at this point. resctrl gives us a "resource" to start
>> with and we are now discussing multiple controls per resource. Could you please
>> elaborate what you see as "parent" and "child"?
>
> See above -- the parent/child concept is not an MPAM thing; apologies
> if I didn't make that clear.
>
>> We do have the info directory available to express relationships and a
>> hierarchy is already starting to taking shape there.
>
> I'm wondering whether using a common prefix will be future-proof? It
> may not always be clear which part of a name counts as the common
> prefix.
Apologies for my cryptic response. I was actually musing that we already
discussed using the info directory to express relationships between
controls and resources and it does not seem a big leap to expand
this to express relationships between controls. Consider something
like below for MPAM:
info
└── MB
└── resource_schemata
└── MB
└── MB_HW
├── MB_MAX
└── MB_MIN
On RDT it may then look different:
info
└── MB
└── resource_schemata
└── MB
├── MB_HW
├── MB_MAX
└── MB_MIN
Having the resource name as common prefix does seem consistent and makes
clear to user space which controls apply to a resource.
>
> There were already discussions about appending a number to a schema
> name in order to control different memory regions -- that's another
> prefix/suffix relationship, if so...
>
> We could handle all of this by documenting all the relationships
> explicitly. But I'm thinking that it could be easier for maintanance
> if the resctrl core code has explicit knowledge of the relationships.
Not just for resctrl self but to make clear to user space which
controls impact others and which are independent.
> That said, using a common prefix is still a good idea. But maybe we
> shouldn't lean on it too heavily as a way of actually describing the
> relationships?
I do not think we can rely on order in schemata file though. For example,
I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
case the schemata may print something like below on both platforms (copied from
your original example) where for MPAM it implies a relationship but for RDT it
does not:
MB: 0=50, 1=50
# MB_HW: 0=32, 1=32
# MB_MIN: 0=31, 1=31
# MB_MAX: 0=32, 1=32
>>> Since the "#" convention is for backward compatibility, maybe we should
>>> not use this for new schemata, and place the burden of managing
>>> conflicts onto userspace going forward. What do you think?
>>
>> I agree. The way I understand this is that the '#' will only be used for
>> new controls that shadow the default/current controls of the legacy resources.
>> I do not expect that the prefix will be needed for new resources, even if
>> the initial support of a new resource does not include all possible controls.
>
> OK. Note, relating this to the above, the # could be interpreted as
> meaning "this is a child of some other schema; don't mess with it
> unless you know what you are doing".
Could it be made more specific to be "this is a child of a legacy schema created
before this new format existed; don't mess with it unless you know what you are
doing"?
That is, any schema created after this new format is established does not need
the '#' prefix even if there is a parent/child relationship?
>
> Older software doesn't understand the relationships, so this is just
> there to stop it from shooting itself in the foot.
ack.
By extension I assume that software that understands a schema that is introduced
after the "relationship" format is established can be expected to understand the
format and thus these new schemata do not require the '#' prefix. Even if
a new schema is introduced with a single control it can be followed by a new child
control without a '#' prefix a couple of kernel releases later. By this point it
should hopefully be understood by user space that it should not write entries it does
not understand.
>
> [...]
>
>>>>>> MPAM has the "HARDLIM" distinction associated with these MAX values
>>>>>> and from what I can tell this is per PARTID. Is this something that needs
>>>>>> to be supported? To do this resctrl will need to support modifying
>>>>>> control properties per resource group.
>>>>>
>>>>> Possibly. Since this is a boolean control that determines how the
>>>>> MBW_MAX control is applied, we could perhaps present it as an
>>>>> additional schema -- if so, it's basically orthogonal.
>>>>>
>>>>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...]
>>>>>
>>>>> or
>>>>>
>>>>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...]
>>>>>
>>>>> Does this look reasonable?
>>>>
>>>> It does.
>>>
>>> OK -- note, I don't think we have any immediate plan to support this in
>>> the MPAM driver, but it may land eventually in some form.
>>>
>>
>> ack.
>
> (Or, of course, anything else that achieves the same goal...)
Right ... I did not dig into syntax that could be made to match existing
schema formats etc. that can be filled in later.
...
>>> I'll try to pull the state of this discussion together -- maybe as a
>>> draft update to the documentation, describing the interface as proposed
>>> so far. Does that work for you?
>>
>> It does. Thank you very much for taking this on.
>>
>> Reinette
>
> OK, I'll aim to follow up on this next week.
Thank you very much.
Reinette
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-17 15:59 ` Reinette Chatre
@ 2025-10-20 15:50 ` Dave Martin
2025-10-20 16:31 ` Luck, Tony
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-10-20 15:50 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Reinette,
On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote:
> Hi Dave,
>
> On 10/17/25 7:17 AM, Dave Martin wrote:
> > Hi Reinette,
> >
> > On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> On 10/15/25 8:47 AM, Dave Martin wrote:
[...]
> >>> To avoid printing entries in the wrong order, do we want to track some
> >>> parent/child relationship between schemata.
> >>>
> >>> In the above example,
> >>>
> >>> * MB is the parent of MB_HW;
> >>>
> >>> * MB_HW is the parent of MB_MIN and MB_MAX.
> >>>
> >>> (for MPAM, at least).
> >>
> >> Could you please elaborate this relationship? I envisioned the MB_HW to be
> >> something similar to Intel RDT's "optimal" bandwidth setting ... something
> >> that is expected to be somewhere between the "min" and the "max".
> >>
> >> But, now I think I'm a bit lost in MPAM since it is not clear to me what
> >> MB_HW represents ... would this be the "memory bandwidth portion
> >> partitioning"? Although, that uses a completely different format from
> >> "min" and "max".
> >
> > I confess that I'm thinking with an MPAM mindset here.
> >
> > Some pseudocode might help to illustrate how these might interact:
> >
> > set_MB(partid, val) {
> > set_MB_HW(partid, percent_to_hw_val(val));
[...]
> > get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
> >
> >
> > The parent/child relationship I suggested is basically the call-graph
> > of this pseudocode. These could all be exposed as resctrl schemata,
> > but the children provide finer / more broken-down control than the
> > parents. Reading a parent provides a merged or approximated view of
> > the configuration of the child schemata.
> >
> > In particular,
> >
> > set_child(partid, get_child(partid));
> > get_parent(partid);
> >
> > yields the same result as
> >
> > get_parent(partid);
> >
> > but will not be true in general, if the roles of parent and child are
> > reversed.
> >
> > I think still this holds true if implementing an "MB_HW" schema for
> > newer revisions of RDT. The pseudocode would be different, but there
> > will still be a tree-like call graph (?)
>
> Thank you very much for the example. I missed in earlier examples that
> MB_HW was being controlled via MB_MAX and MB_MIN.
> I do not expect such a dependence or tree-like call graph for RDT where
> the closest equivalent (termed "optimal") is programmed independently from
> min and max.
I hadn't realised that this RDT feature as three control thresholds.
I'll comment in more detail on your sample info/ hierarchy, below.
> >
> > Going back to MPAM:
[...]
> > So, it may make more sense to expose [MBWPBM] as a separate, bitmap schema.
> >
> > (The same goes for "Proportional stride" partitioning. It's another,
> > different, control for memory bandwidth. As of today, I don't think
> > that we have a reference platform for experimenting with either of
> > these.)
>
> Thank you.
>
> >
> >
> >>> When schemata is read, parents should always be printed before their
> >>> child schemata. But really, we just need to make sure that the
> >>> rdt_schema_all list is correctly ordered.
> >>>
> >>>
> >>> Do you think that this relationship needs to be reported to userspace?
[...]
> >> We do have the info directory available to express relationships and a
> >> hierarchy is already starting to taking shape there.
> >
> > I'm wondering whether using a common prefix will be future-proof? It
> > may not always be clear which part of a name counts as the common
> > prefix.
>
> Apologies for my cryptic response. I was actually musing that we already
> discussed using the info directory to express relationships between
> controls and resources and it does not seem a big leap to expand
> this to express relationships between controls. Consider something
> like below for MPAM:
>
> info
> └── MB
> └── resource_schemata
> └── MB
> └── MB_HW
> ├── MB_MAX
> └── MB_MIN
>
>
> On RDT it may then look different:
>
> info
> └── MB
> └── resource_schemata
> └── MB
> ├── MB_HW
> ├── MB_MAX
> └── MB_MIN
>
> Having the resource name as common prefix does seem consistent and makes
> clear to user space which controls apply to a resource.
Ack.
The above hierarchies make sense, but I wonder whether we should be
forcing software to understand the MIN and MAX limits?
I can still see a benefit in having MB_HW be a generic, software-
defined control, even on RDT. Then, this can always be available,
with similar behaviour, on all resctrl instances that support memory
bandwidth controls. The precise set of child controls will vary per
arch (and on MPAM at least, between different hardware
implementations) -- so these look like they will work less well as a
generic interface.
Considering RDT: to avoid random regulation behaviour, RDT says that
you need MIN <= OPT <= MAX, so a generic "MB_HW" control that does not
require software to understand the individual MIN, OPT and MAX
thresholds would still need to program all of these under the hood so
as to avoid an invalid combination being set in the hardware.
If I have understood the definition of the MARC table correctly, then
there is a separate flag to report the presence of each of MIN, MAX and
OPT, so software _might_ be expected to use a random subset of them(?)
(If so, that's somewhat like the MPAM situation.)
So, I wonder whether we could actually have the following on RDT?
info
├── MB
┆ └── resource_schemata
├── MB
┆ └── MB_HW
├── MB_MAX
├── MB_MIN
└── MB_OPT
If MB_HW is programmed by software, then MB_MAX, MB_OPT and MB_MIN
would be programmed with some reasonable default spread (or possibly,
all with the same value).
That way, software that wants independent control over MIN, OPT and MAX
can have it (and sweat the problem of dealing with hardware where they
aren't all implemented -- if that's a thing). But software that
doesn't need this fine control gets a single MB_HW knob that is more-or-
less portable between platforms.
Does that makes sense, or is it an abstraction too far?
(Going one step further, maybe we can actually put MPAM and RDT
together with a 3-threshold model. For MPAM, we could possibly express
the HARDLIM option using the extra threshold... that probably needs a
bit more thought, though.)
> > There were already discussions about appending a number to a schema
> > name in order to control different memory regions -- that's another
> > prefix/suffix relationship, if so...
> >
> > We could handle all of this by documenting all the relationships
> > explicitly. But I'm thinking that it could be easier for maintanance
> > if the resctrl core code has explicit knowledge of the relationships.
>
> Not just for resctrl self but to make clear to user space which
> controls impact others and which are independent.
> > That said, using a common prefix is still a good idea. But maybe we
> > shouldn't lean on it too heavily as a way of actually describing the
> > relationships?
> I do not think we can rely on order in schemata file though. For example,
> I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
> also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
> case the schemata may print something like below on both platforms (copied from
> your original example) where for MPAM it implies a relationship but for RDT it
> does not:
>
> MB: 0=50, 1=50
> # MB_HW: 0=32, 1=32
> # MB_MIN: 0=31, 1=31
> # MB_MAX: 0=32, 1=32
This still DTRT though? If MB_HW maps into the "optimal bandwidth"
control on RDT, then it is still safe to program it first, before
MB_{MIN,MAX}.
The contents of the schemata file won't be suffucient to figure out the
relationships, but that wasn't my intention. We have info/ for that.
Instead, the schemata file just needs to be ordered in a way that is
compatible with those relationships, so that one line does not
unintentionally clobber the effect of a subsequent line.
My concern was that if we rely totally on manual maintenance to keep the
schemata file in a compatible order, we'll probably get that wrong
sooner or later...
> >>> Since the "#" convention is for backward compatibility, maybe we should
> >>> not use this for new schemata, and place the burden of managing
> >>> conflicts onto userspace going forward. What do you think?
> >>
> >> I agree. The way I understand this is that the '#' will only be used for
> >> new controls that shadow the default/current controls of the legacy resources.
> >> I do not expect that the prefix will be needed for new resources, even if
> >> the initial support of a new resource does not include all possible controls.
> >
> > OK. Note, relating this to the above, the # could be interpreted as
> > meaning "this is a child of some other schema; don't mess with it
> > unless you know what you are doing".
>
> Could it be made more specific to be "this is a child of a legacy schema created
> before this new format existed; don't mess with it unless you know what you are
> doing"?
> That is, any schema created after this new format is established does not need
> the '#' prefix even if there is a parent/child relationship?
Yes, I think so.
Except: if some schema is advertised and documented with no children,
then is it reasonable for software to assume that it will never have
children?
I think that the answer is probably "yes", in which case would it make
sense to # any schema that is a child of some schema that did not have
children in some previous upstream kernel?
> >
> > Older software doesn't understand the relationships, so this is just
> > there to stop it from shooting itself in the foot.
>
> ack.
>
> By extension I assume that software that understands a schema that is introduced
> after the "relationship" format is established can be expected to understand the
> format and thus these new schemata do not require the '#' prefix. Even if
> a new schema is introduced with a single control it can be followed by a new child
> control without a '#' prefix a couple of kernel releases later. By this point it
> should hopefully be understood by user space that it should not write entries it does
> not understand.
Generally, yes.
I think that boils down to: "OK, previously you could just tweak bits
of the whole schemata file you read and write the whole thing back,
and the effect would be what you inuitively expected. But in future
different schemata in the file may not be independent of one another.
We'll warn you which things might not be independent, but we may not
describe exactly how they affect each other.
"So, from now on, only write the things that you actually want to set."
Does that sound about right?
[...]
> >>>
> >>> OK -- note, I don't think we have any immediate plan to support [HARDLIM] in
> >>> the MPAM driver, but it may land eventually in some form.
> >>>
> >>
> >> ack.
> >
> > (Or, of course, anything else that achieves the same goal...)
>
> Right ... I did not dig into syntax that could be made to match existing
> schema formats etc. that can be filled in later.
Ack
> ...
>
> >>> I'll try to pull the state of this discussion together -- maybe as a
> >>> draft update to the documentation, describing the interface as proposed
> >>> so far. Does that work for you?
> >>
> >> It does. Thank you very much for taking this on.
> >>
> >> Reinette
> >
> > OK, I'll aim to follow up on this next week.
>
> Thank you very much.
>
> Reinette
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-20 15:50 ` Dave Martin
@ 2025-10-20 16:31 ` Luck, Tony
2025-10-21 14:37 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2025-10-20 16:31 UTC (permalink / raw)
To: Dave Martin
Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote:
> Hi Reinette,
>
> On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote:
> > Hi Dave,
> >
> > On 10/17/25 7:17 AM, Dave Martin wrote:
> > > Hi Reinette,
> > >
> > > On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote:
> > >> Hi Dave,
> > >>
> > >> On 10/15/25 8:47 AM, Dave Martin wrote:
>
> [...]
>
> > >>> To avoid printing entries in the wrong order, do we want to track some
> > >>> parent/child relationship between schemata.
> > >>>
> > >>> In the above example,
> > >>>
> > >>> * MB is the parent of MB_HW;
> > >>>
> > >>> * MB_HW is the parent of MB_MIN and MB_MAX.
> > >>>
> > >>> (for MPAM, at least).
> > >>
> > >> Could you please elaborate this relationship? I envisioned the MB_HW to be
> > >> something similar to Intel RDT's "optimal" bandwidth setting ... something
> > >> that is expected to be somewhere between the "min" and the "max".
> > >>
> > >> But, now I think I'm a bit lost in MPAM since it is not clear to me what
> > >> MB_HW represents ... would this be the "memory bandwidth portion
> > >> partitioning"? Although, that uses a completely different format from
> > >> "min" and "max".
> > >
> > > I confess that I'm thinking with an MPAM mindset here.
> > >
> > > Some pseudocode might help to illustrate how these might interact:
> > >
> > > set_MB(partid, val) {
> > > set_MB_HW(partid, percent_to_hw_val(val));
>
> [...]
>
> > > get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; }
> > >
> > >
> > > The parent/child relationship I suggested is basically the call-graph
> > > of this pseudocode. These could all be exposed as resctrl schemata,
> > > but the children provide finer / more broken-down control than the
> > > parents. Reading a parent provides a merged or approximated view of
> > > the configuration of the child schemata.
> > >
> > > In particular,
> > >
> > > set_child(partid, get_child(partid));
> > > get_parent(partid);
> > >
> > > yields the same result as
> > >
> > > get_parent(partid);
> > >
> > > but will not be true in general, if the roles of parent and child are
> > > reversed.
> > >
> > > I think still this holds true if implementing an "MB_HW" schema for
> > > newer revisions of RDT. The pseudocode would be different, but there
> > > will still be a tree-like call graph (?)
> >
> > Thank you very much for the example. I missed in earlier examples that
> > MB_HW was being controlled via MB_MAX and MB_MIN.
> > I do not expect such a dependence or tree-like call graph for RDT where
> > the closest equivalent (termed "optimal") is programmed independently from
> > min and max.
>
> I hadn't realised that this RDT feature as three control thresholds.
>
> I'll comment in more detail on your sample info/ hierarchy, below.
>
> > >
> > > Going back to MPAM:
>
> [...]
>
> > > So, it may make more sense to expose [MBWPBM] as a separate, bitmap schema.
> > >
> > > (The same goes for "Proportional stride" partitioning. It's another,
> > > different, control for memory bandwidth. As of today, I don't think
> > > that we have a reference platform for experimenting with either of
> > > these.)
> >
> > Thank you.
> >
> > >
> > >
> > >>> When schemata is read, parents should always be printed before their
> > >>> child schemata. But really, we just need to make sure that the
> > >>> rdt_schema_all list is correctly ordered.
> > >>>
> > >>>
> > >>> Do you think that this relationship needs to be reported to userspace?
>
> [...]
>
> > >> We do have the info directory available to express relationships and a
> > >> hierarchy is already starting to taking shape there.
> > >
> > > I'm wondering whether using a common prefix will be future-proof? It
> > > may not always be clear which part of a name counts as the common
> > > prefix.
> >
> > Apologies for my cryptic response. I was actually musing that we already
> > discussed using the info directory to express relationships between
> > controls and resources and it does not seem a big leap to expand
> > this to express relationships between controls. Consider something
> > like below for MPAM:
> >
> > info
> > └── MB
> > └── resource_schemata
> > └── MB
> > └── MB_HW
> > ├── MB_MAX
> > └── MB_MIN
> >
> >
> > On RDT it may then look different:
> >
> > info
> > └── MB
> > └── resource_schemata
> > └── MB
> > ├── MB_HW
> > ├── MB_MAX
> > └── MB_MIN
> >
> > Having the resource name as common prefix does seem consistent and makes
> > clear to user space which controls apply to a resource.
>
> Ack.
>
> The above hierarchies make sense, but I wonder whether we should be
> forcing software to understand the MIN and MAX limits?
>
> I can still see a benefit in having MB_HW be a generic, software-
> defined control, even on RDT. Then, this can always be available,
> with similar behaviour, on all resctrl instances that support memory
> bandwidth controls. The precise set of child controls will vary per
> arch (and on MPAM at least, between different hardware
> implementations) -- so these look like they will work less well as a
> generic interface.
>
>
> Considering RDT: to avoid random regulation behaviour, RDT says that
> you need MIN <= OPT <= MAX, so a generic "MB_HW" control that does not
> require software to understand the individual MIN, OPT and MAX
> thresholds would still need to program all of these under the hood so
> as to avoid an invalid combination being set in the hardware.
>
> If I have understood the definition of the MARC table correctly, then
> there is a separate flag to report the presence of each of MIN, MAX and
> OPT, so software _might_ be expected to use a random subset of them(?)
> (If so, that's somewhat like the MPAM situation.)
>
> So, I wonder whether we could actually have the following on RDT?
>
> info
> ├── MB
> ┆ └── resource_schemata
> ├── MB
> ┆ └── MB_HW
> ├── MB_MAX
> ├── MB_MIN
> └── MB_OPT
>
> If MB_HW is programmed by software, then MB_MAX, MB_OPT and MB_MIN
> would be programmed with some reasonable default spread (or possibly,
> all with the same value).
>
> That way, software that wants independent control over MIN, OPT and MAX
> can have it (and sweat the problem of dealing with hardware where they
> aren't all implemented -- if that's a thing). But software that
> doesn't need this fine control gets a single MB_HW knob that is more-or-
> less portable between platforms.
>
> Does that makes sense, or is it an abstraction too far?
>
>
> (Going one step further, maybe we can actually put MPAM and RDT
> together with a 3-threshold model. For MPAM, we could possibly express
> the HARDLIM option using the extra threshold... that probably needs a
> bit more thought, though.)
>
> > > There were already discussions about appending a number to a schema
> > > name in order to control different memory regions -- that's another
> > > prefix/suffix relationship, if so...
> > >
> > > We could handle all of this by documenting all the relationships
> > > explicitly. But I'm thinking that it could be easier for maintanance
> > > if the resctrl core code has explicit knowledge of the relationships.
> >
> > Not just for resctrl self but to make clear to user space which
> > controls impact others and which are independent.
> > > That said, using a common prefix is still a good idea. But maybe we
> > > shouldn't lean on it too heavily as a way of actually describing the
> > > relationships?
> > I do not think we can rely on order in schemata file though. For example,
> > I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to
> > also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either
> > case the schemata may print something like below on both platforms (copied from
> > your original example) where for MPAM it implies a relationship but for RDT it
> > does not:
> >
> > MB: 0=50, 1=50
> > # MB_HW: 0=32, 1=32
> > # MB_MIN: 0=31, 1=31
> > # MB_MAX: 0=32, 1=32
>
> This still DTRT though? If MB_HW maps into the "optimal bandwidth"
> control on RDT, then it is still safe to program it first, before
> MB_{MIN,MAX}.
>
> The contents of the schemata file won't be suffucient to figure out the
> relationships, but that wasn't my intention. We have info/ for that.
>
> Instead, the schemata file just needs to be ordered in a way that is
> compatible with those relationships, so that one line does not
> unintentionally clobber the effect of a subsequent line.
>
>
> My concern was that if we rely totally on manual maintenance to keep the
> schemata file in a compatible order, we'll probably get that wrong
> sooner or later...
>
> > >>> Since the "#" convention is for backward compatibility, maybe we should
> > >>> not use this for new schemata, and place the burden of managing
> > >>> conflicts onto userspace going forward. What do you think?
> > >>
> > >> I agree. The way I understand this is that the '#' will only be used for
> > >> new controls that shadow the default/current controls of the legacy resources.
> > >> I do not expect that the prefix will be needed for new resources, even if
> > >> the initial support of a new resource does not include all possible controls.
> > >
> > > OK. Note, relating this to the above, the # could be interpreted as
> > > meaning "this is a child of some other schema; don't mess with it
> > > unless you know what you are doing".
> >
> > Could it be made more specific to be "this is a child of a legacy schema created
> > before this new format existed; don't mess with it unless you know what you are
> > doing"?
> > That is, any schema created after this new format is established does not need
> > the '#' prefix even if there is a parent/child relationship?
>
> Yes, I think so.
>
> Except: if some schema is advertised and documented with no children,
> then is it reasonable for software to assume that it will never have
> children?
>
> I think that the answer is probably "yes", in which case would it make
> sense to # any schema that is a child of some schema that did not have
> children in some previous upstream kernel?
>
> > >
> > > Older software doesn't understand the relationships, so this is just
> > > there to stop it from shooting itself in the foot.
> >
> > ack.
> >
> > By extension I assume that software that understands a schema that is introduced
> > after the "relationship" format is established can be expected to understand the
> > format and thus these new schemata do not require the '#' prefix. Even if
> > a new schema is introduced with a single control it can be followed by a new child
> > control without a '#' prefix a couple of kernel releases later. By this point it
> > should hopefully be understood by user space that it should not write entries it does
> > not understand.
>
> Generally, yes.
>
> I think that boils down to: "OK, previously you could just tweak bits
> of the whole schemata file you read and write the whole thing back,
> and the effect would be what you inuitively expected. But in future
> different schemata in the file may not be independent of one another.
> We'll warn you which things might not be independent, but we may not
> describe exactly how they affect each other.
Changes to the schemata file are currently "staged" and then applied.
There's some filesystem level error/sanity checking during the parsing
phase, but maybe for MB some parts can also be delayed, and re-ordered
when architecture code applies the changes.
E.g. while filesystem code could check min <= opt <= max. Architecture
code would be responsible to write the values to h/w in a sane manner
(assuming architecture cares about transient effects when things don't
conform to the ordering).
E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
Regardless of the order those requests appeared in the write(2) syscall
architecture bumps max to 60, then opt to 50, and finally min to 40.
>
> "So, from now on, only write the things that you actually want to set."
>
> Does that sound about right?
Users might still use their favorite editor on the schemata file and
so write everything, while only changing a subset. So if we don't go
for the full two-phase update I describe above this would be:
"only *change* the things that you actually want to set".
> [...]
>
> > >>>
> > >>> OK -- note, I don't think we have any immediate plan to support [HARDLIM] in
> > >>> the MPAM driver, but it may land eventually in some form.
> > >>>
> > >>
> > >> ack.
> > >
> > > (Or, of course, anything else that achieves the same goal...)
> >
> > Right ... I did not dig into syntax that could be made to match existing
> > schema formats etc. that can be filled in later.
>
> Ack
>
> > ...
> >
> > >>> I'll try to pull the state of this discussion together -- maybe as a
> > >>> draft update to the documentation, describing the interface as proposed
> > >>> so far. Does that work for you?
> > >>
> > >> It does. Thank you very much for taking this on.
> > >>
> > >> Reinette
> > >
> > > OK, I'll aim to follow up on this next week.
> >
> > Thank you very much.
> >
> > Reinette
>
> Cheers
> ---Dave
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-20 16:31 ` Luck, Tony
@ 2025-10-21 14:37 ` Dave Martin
2025-10-21 20:59 ` Luck, Tony
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-10-21 14:37 UTC (permalink / raw)
To: Luck, Tony
Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Tony,
On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote:
> On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote:
> > Hi Reinette,
> >
> > On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote:
[...]
> > > By extension I assume that software that understands a schema that is introduced
> > > after the "relationship" format is established can be expected to understand the
> > > format and thus these new schemata do not require the '#' prefix. Even if
> > > a new schema is introduced with a single control it can be followed by a new child
> > > control without a '#' prefix a couple of kernel releases later. By this point it
> > > should hopefully be understood by user space that it should not write entries it does
> > > not understand.
> >
> > Generally, yes.
> >
> > I think that boils down to: "OK, previously you could just tweak bits
> > of the whole schemata file you read and write the whole thing back,
> > and the effect would be what you inuitively expected. But in future
> > different schemata in the file may not be independent of one another.
> > We'll warn you which things might not be independent, but we may not
> > describe exactly how they affect each other.
>
> Changes to the schemata file are currently "staged" and then applied.
> There's some filesystem level error/sanity checking during the parsing
> phase, but maybe for MB some parts can also be delayed, and re-ordered
> when architecture code applies the changes.
>
> E.g. while filesystem code could check min <= opt <= max. Architecture
> code would be responsible to write the values to h/w in a sane manner
> (assuming architecture cares about transient effects when things don't
> conform to the ordering).
>
> E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
> Regardless of the order those requests appeared in the write(2) syscall
> architecture bumps max to 60, then opt to 50, and finally min to 40.
This could be sorted indeed be sorted out during staging, but I'm not
sure that we can/should rely on it.
If we treat the data coming from a single write() as a transaction, and
stage the whole thing before executing it, that's fine. But I think
this has to be viewed as an optimisation rather than guaranteed
semantics.
We told userspace that schemata is an S_IFREG regular file, so we have
to accept a write() boundary anywhere in the stream.
(In fact, resctrl chokes if a write boundary occurs in the middle of a
line. In practice, stdio buffering and similar means that this issue
turns out to be difficult to hit, except with shell scripts that try to
emit a line piecemeal -- I have a partial fix for that knocking around,
but this throws up other problems, so I gave up for the time being.)
We also cannot currently rely on userspace closing the fd between
"transactions". We never told userspace to do that, previously. We
could make a new requirement, but it feels unexpected/unreasonable (?)
> >
> > "So, from now on, only write the things that you actually want to set."
> >
> > Does that sound about right?
>
> Users might still use their favorite editor on the schemata file and
> so write everything, while only changing a subset. So if we don't go
> for the full two-phase update I describe above this would be:
>
> "only *change* the things that you actually want to set".
[...]
> -Tony
This works if the schemata file is output in the right order (and the
user doesn't change the order):
# cat schemata
MB:0=100;1=100
# MB_HW:0=1024;1=1024
->
# cat <<EOF >schemata
MB:0=100;1=100
MB_HW:0=512,1=512
EOF
... though it may still be inefficient, if the lines are not staged
together. The hardware memory bandwidth controls may get programmed
twice, here -- though the final result is probably what was intended.
I'd still prefer that we tell people that they should be doing this:
# cat <<EOF >schemata
MB_HW:0=512,1=512
EOF
...if they are really tyring to set MB_HW and don't care about the
effect on MB?
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-21 14:37 ` Dave Martin
@ 2025-10-21 20:59 ` Luck, Tony
2025-10-22 14:58 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2025-10-21 20:59 UTC (permalink / raw)
To: Dave Martin
Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote:
> Hi Tony,
>
> On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote:
> > On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote:
> > > Hi Reinette,
> > >
> > > On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote:
>
> [...]
>
> > > > By extension I assume that software that understands a schema that is introduced
> > > > after the "relationship" format is established can be expected to understand the
> > > > format and thus these new schemata do not require the '#' prefix. Even if
> > > > a new schema is introduced with a single control it can be followed by a new child
> > > > control without a '#' prefix a couple of kernel releases later. By this point it
> > > > should hopefully be understood by user space that it should not write entries it does
> > > > not understand.
> > >
> > > Generally, yes.
> > >
> > > I think that boils down to: "OK, previously you could just tweak bits
> > > of the whole schemata file you read and write the whole thing back,
> > > and the effect would be what you inuitively expected. But in future
> > > different schemata in the file may not be independent of one another.
> > > We'll warn you which things might not be independent, but we may not
> > > describe exactly how they affect each other.
> >
> > Changes to the schemata file are currently "staged" and then applied.
> > There's some filesystem level error/sanity checking during the parsing
> > phase, but maybe for MB some parts can also be delayed, and re-ordered
> > when architecture code applies the changes.
> >
> > E.g. while filesystem code could check min <= opt <= max. Architecture
> > code would be responsible to write the values to h/w in a sane manner
> > (assuming architecture cares about transient effects when things don't
> > conform to the ordering).
> >
> > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
> > Regardless of the order those requests appeared in the write(2) syscall
> > architecture bumps max to 60, then opt to 50, and finally min to 40.
>
> This could be sorted indeed be sorted out during staging, but I'm not
> sure that we can/should rely on it.
>
> If we treat the data coming from a single write() as a transaction, and
> stage the whole thing before executing it, that's fine. But I think
> this has to be viewed as an optimisation rather than guaranteed
> semantics.
>
>
> We told userspace that schemata is an S_IFREG regular file, so we have
> to accept a write() boundary anywhere in the stream.
>
> (In fact, resctrl chokes if a write boundary occurs in the middle of a
> line. In practice, stdio buffering and similar means that this issue
> turns out to be difficult to hit, except with shell scripts that try to
> emit a line piecemeal -- I have a partial fix for that knocking around,
> but this throws up other problems, so I gave up for the time being.)
Is this worth the pain and complexity? Maybe just document the reality
of the implementation since day 1 of resctrl that each write(2) must
contain one or more lines, each terminated with "\n".
There are already so many ways that the schemata file does not behave
like a regular S_IFREG file. E.g. accepting a write to just update
one domain in a resource: # echo L3:2=0xff > schemata
So describe schemata in terms of writing "update commands" rather
than "Lines"?
>
> We also cannot currently rely on userspace closing the fd between
> "transactions". We never told userspace to do that, previously. We
> could make a new requirement, but it feels unexpected/unreasonable (?)
>
> > >
> > > "So, from now on, only write the things that you actually want to set."
> > >
> > > Does that sound about right?
> >
> > Users might still use their favorite editor on the schemata file and
> > so write everything, while only changing a subset. So if we don't go
> > for the full two-phase update I describe above this would be:
> >
> > "only *change* the things that you actually want to set".
I misremembered where the check for "did the user change the value"
happened. I thought it was during parsing, but it is actually in
resctrl_arch_update_domains() after all input parsing is complete
and resctrl is applying changes. So unless we change things to work
the way I hallucinated, then ordering does matter the way you
described.
>
> [...]
>
> > -Tony
>
> This works if the schemata file is output in the right order (and the
> user doesn't change the order):
>
> # cat schemata
> MB:0=100;1=100
> # MB_HW:0=1024;1=1024
>
> ->
>
> # cat <<EOF >schemata
> MB:0=100;1=100
> MB_HW:0=512,1=512
> EOF
>
> ... though it may still be inefficient, if the lines are not staged
> together. The hardware memory bandwidth controls may get programmed
> twice, here -- though the final result is probably what was intended.
>
> I'd still prefer that we tell people that they should be doing this:
> # cat <<EOF >schemata
> MB_HW:0=512,1=512
> EOF
>
> ...if they are really tyring to set MB_HW and don't care about the
> effect on MB?
I'm starting to worry about this co-existence of old/new syntax for
Intel region aware. Life seems simple if there is only one MB_HW
connected to the legacy "MB". Updates to either will make both
appear with new values when the schemata is read. E.g.
# cat schemata
MB:0=100
#MB_HW=255
# echo MB:0=50 > schemata
# cat schemata
MB:0=50
#MB_HW=127
But Intel will have several MB_HW controls, one for each region.
[Schemata names TBD, but I'll just call them 0, 1, 2, 3 here]
# cat schemata
MB:0=100
#MB_HW0=255
#MB_HW1=255
#MB_HW2=255
#MB_HW3=255
If the user sets just one of the HW controls:
# echo MB_HW1=64
what should resctrl display for the legacy "MB:" line?
>
> Cheers
> ---Dave
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-21 20:59 ` Luck, Tony
@ 2025-10-22 14:58 ` Dave Martin
2025-10-22 16:21 ` Luck, Tony
0 siblings, 1 reply; 52+ messages in thread
From: Dave Martin @ 2025-10-22 14:58 UTC (permalink / raw)
To: Luck, Tony
Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Tony,
On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote:
> Hi Dave,
>
> On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote:
> > Hi Tony,
> >
> > On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote:
[...]
> > > Changes to the schemata file are currently "staged" and then applied.
> > > There's some filesystem level error/sanity checking during the parsing
> > > phase, but maybe for MB some parts can also be delayed, and re-ordered
> > > when architecture code applies the changes.
> > >
> > > E.g. while filesystem code could check min <= opt <= max. Architecture
> > > code would be responsible to write the values to h/w in a sane manner
> > > (assuming architecture cares about transient effects when things don't
> > > conform to the ordering).
> > >
> > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
> > > Regardless of the order those requests appeared in the write(2) syscall
> > > architecture bumps max to 60, then opt to 50, and finally min to 40.
> >
> > This could be sorted indeed be sorted out during staging, but I'm not
> > sure that we can/should rely on it.
> >
> > If we treat the data coming from a single write() as a transaction, and
> > stage the whole thing before executing it, that's fine. But I think
> > this has to be viewed as an optimisation rather than guaranteed
> > semantics.
> >
> >
> > We told userspace that schemata is an S_IFREG regular file, so we have
> > to accept a write() boundary anywhere in the stream.
> >
> > (In fact, resctrl chokes if a write boundary occurs in the middle of a
> > line. In practice, stdio buffering and similar means that this issue
> > turns out to be difficult to hit, except with shell scripts that try to
> > emit a line piecemeal -- I have a partial fix for that knocking around,
> > but this throws up other problems, so I gave up for the time being.)
>
> Is this worth the pain and complexity? Maybe just document the reality
> of the implementation since day 1 of resctrl that each write(2) must
> contain one or more lines, each terminated with "\n".
<soapbox>
We could, in the same way that a vendor could wire a UART directly to
the pins of a regular mains power plug. They could stick a big label
on it saying exactly how the pins should be hooked up to another low-
voltage UART and not plugged into a mains power outlet... but you know
what's going to happen.
The whole point of a file-like interface is that the user doesn't (or
shouldn't) have to craft I/O directly at the syscall level. If they
have to do that, then the reasons for not relying on ioctl() or a
binary protocol melt away (like that UART).
Because the easy, unsafe way of working with these files almost always
works, people are almost certainly going to use it, even if we tell
them not to (IMHO).
</soapbox>
That said, for practical purposes, the interface is reliable enough
(for now). We probably shouldn't mess with it unless we can come up
with something that is clearly better.
(I have some ideas, but I think it's off-topic, here.)
> There are already so many ways that the schemata file does not behave
> like a regular S_IFREG file. E.g. accepting a write to just update
> one domain in a resource: # echo L3:2=0xff > schemata
That still feels basically file-like. I can write something into a
file, then something else can read what I wrote, interpret it in any
way it likes, and write back something different for me to read.
In our case, it is as if after each write() the kernel magically reads
and rewrites the file before userspace gets a chance to do anything
else. This doesn't work as a protocol between userspace processes, but
the kernel can pull tricks that are not available to userspace -- so it
can be made to work for user <-> kernel protocols (modulo the issues
about write() boundaries etc.)
> So describe schemata in terms of writing "update commands" rather
> than "Lines"?
That's reasonable. In practice, each line written is a request to the
kernel to do something, but it's already the case that the kernel
doesn't necessarily do exactly what was asked for (due to rounding,
etc.)
Overall, I think the current state of play is that we need to consider
the lines to be independent "commands", and execute them in the order
given.
That's the model I've been assuming here.
> > We also cannot currently rely on userspace closing the fd between
> > "transactions". We never told userspace to do that, previously. We
> > could make a new requirement, but it feels unexpected/unreasonable (?)
> >
> > > >
> > > > "So, from now on, only write the things that you actually want to set."
> > > >
> > > > Does that sound about right?
> > >
> > > Users might still use their favorite editor on the schemata file and
> > > so write everything, while only changing a subset. So if we don't go
> > > for the full two-phase update I describe above this would be:
> > >
> > > "only *change* the things that you actually want to set".
>
> I misremembered where the check for "did the user change the value"
> happened. I thought it was during parsing, but it is actually in
> resctrl_arch_update_domains() after all input parsing is complete
> and resctrl is applying changes. So unless we change things to work
> the way I hallucinated, then ordering does matter the way you
> described.
Ah, right.
There would be different ways to do this, but yes, that was my
understanding of how things work today.
> >
> > [...]
> >
> > > -Tony
> >
> > This works if the schemata file is output in the right order (and the
> > user doesn't change the order):
> >
> > # cat schemata
> > MB:0=100;1=100
> > # MB_HW:0=1024;1=1024
> >
> > ->
> >
> > # cat <<EOF >schemata
> > MB:0=100;1=100
> > MB_HW:0=512,1=512
> > EOF
> >
> > ... though it may still be inefficient, if the lines are not staged
> > together. The hardware memory bandwidth controls may get programmed
> > twice, here -- though the final result is probably what was intended.
> >
> > I'd still prefer that we tell people that they should be doing this:
> > # cat <<EOF >schemata
> > MB_HW:0=512,1=512
> > EOF
> >
> > ...if they are really tyring to set MB_HW and don't care about the
> > effect on MB?
>
> I'm starting to worry about this co-existence of old/new syntax for
> Intel region aware. Life seems simple if there is only one MB_HW
> connected to the legacy "MB". Updates to either will make both
> appear with new values when the schemata is read. E.g.
>
> # cat schemata
> MB:0=100
> #MB_HW=255
>
> # echo MB:0=50 > schemata
>
> # cat schemata
> MB:0=50
> #MB_HW=127
>
> But Intel will have several MB_HW controls, one for each region.
> [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here]
>
> # cat schemata
> MB:0=100
> #MB_HW0=255
> #MB_HW1=255
> #MB_HW2=255
> #MB_HW3=255
>
> If the user sets just one of the HW controls:
>
> # echo MB_HW1=64
>
> what should resctrl display for the legacy "MB:" line?
>
> -Tony
Erm, good question. I hadn't though too carefully about the region-
aware case.
I think it's reasonable to expect software that writes MB_HW<n>
independently to pay attention only to these specific schemata when
reading back -- a bit like accessing a C union.
# echo 'MB:0=100' >schemata
# cat schemata
->
MB:0=100
# MB_HW:0=255
# MB_HW0:0=255
# MB_HW1:0=255
# MB_HW2:0=255
# MB_HW3:0=255
# echo 'MB:0=100' >schemata
# cat schemata
->
MB:0=50
# MB_HW:0=128
# MB_HW0:0=128
# MB_HW1:0=128
# MB_HW2:0=128
# MB_HW3:0=128
# echo 'MB_HW:0=127' >schemata
# cat schemata
->
MB:0=50
# MB_HW:0=127
# MB_HW0:0=127
# MB_HW1:0=127
# MB_HW2:0=127
# MB_HW3:0=127
# echo 'MB_HW1:0=64' >schemata
# cat schemata
->
MB:0=???
# MB_HW:0=???
# MB_HW0:0=127
# MB_HW1:0=64
# MB_HW2:0=127
# MB_HW3:0=127
The rules for populating the ??? entries could be designed to be
somewhat intuitive, or we could just do the easiest thing.
So, could we just pick one, fixed, region to read the MB_HW value from?
Say, MB_HW0:
MB:0=50
# MB_HW:0=127
# MB_HW0:0=127
# MB_HW1:0=64
# MB_HW2:0=127
# MB_HW3:0=127
Or take the average across all regions:
MB:0=44
# MB_HW:0=111
# MB_HW0:0=127
# MB_HW1:0=64
# MB_HW2:0=127
# MB_HW3:0=127
The latter may be more costly or complex to implement, and I don't
know whether it is really useful. Software that knows about the
MB_HW<n> entries also knows that once you have looked at these, MB_HW
and MB tell you nothing else.
What do you think?
I'm wondering whether setting the MB_HW<n> independently may be quite a
specialised use case, which not everyone will want/need to do, but
that's an assumption on my part.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-22 14:58 ` Dave Martin
@ 2025-10-22 16:21 ` Luck, Tony
2025-10-23 14:04 ` Dave Martin
0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2025-10-22 16:21 UTC (permalink / raw)
To: Dave Martin
Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Dave,
On Wed, Oct 22, 2025 at 03:58:08PM +0100, Dave Martin wrote:
> Hi Tony,
>
> On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote:
> > Hi Dave,
> >
> > On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote:
> > > Hi Tony,
> > >
> > > On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote:
>
> [...]
>
> > > > Changes to the schemata file are currently "staged" and then applied.
> > > > There's some filesystem level error/sanity checking during the parsing
> > > > phase, but maybe for MB some parts can also be delayed, and re-ordered
> > > > when architecture code applies the changes.
> > > >
> > > > E.g. while filesystem code could check min <= opt <= max. Architecture
> > > > code would be responsible to write the values to h/w in a sane manner
> > > > (assuming architecture cares about transient effects when things don't
> > > > conform to the ordering).
> > > >
> > > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60
> > > > Regardless of the order those requests appeared in the write(2) syscall
> > > > architecture bumps max to 60, then opt to 50, and finally min to 40.
> > >
> > > This could be sorted indeed be sorted out during staging, but I'm not
> > > sure that we can/should rely on it.
> > >
> > > If we treat the data coming from a single write() as a transaction, and
> > > stage the whole thing before executing it, that's fine. But I think
> > > this has to be viewed as an optimisation rather than guaranteed
> > > semantics.
> > >
> > >
> > > We told userspace that schemata is an S_IFREG regular file, so we have
> > > to accept a write() boundary anywhere in the stream.
> > >
> > > (In fact, resctrl chokes if a write boundary occurs in the middle of a
> > > line. In practice, stdio buffering and similar means that this issue
> > > turns out to be difficult to hit, except with shell scripts that try to
> > > emit a line piecemeal -- I have a partial fix for that knocking around,
> > > but this throws up other problems, so I gave up for the time being.)
> >
> > Is this worth the pain and complexity? Maybe just document the reality
> > of the implementation since day 1 of resctrl that each write(2) must
> > contain one or more lines, each terminated with "\n".
>
> <soapbox>
>
> We could, in the same way that a vendor could wire a UART directly to
> the pins of a regular mains power plug. They could stick a big label
> on it saying exactly how the pins should be hooked up to another low-
> voltage UART and not plugged into a mains power outlet... but you know
> what's going to happen.
The PDP 11/03 for undegraduate Comp Sci student use at my univeristy had allegedly
been student proofed against such things. Oral history said you could wire 240V
mains across input pins to get a 50 Hz clock. I didn't test this theory.
> The whole point of a file-like interface is that the user doesn't (or
> shouldn't) have to craft I/O directly at the syscall level. If they
> have to do that, then the reasons for not relying on ioctl() or a
> binary protocol melt away (like that UART).
>
> Because the easy, unsafe way of working with these files almost always
> works, people are almost certainly going to use it, even if we tell
> them not to (IMHO).
>
> </soapbox>
>
>
> That said, for practical purposes, the interface is reliable enough
> (for now). We probably shouldn't mess with it unless we can come up
> with something that is clearly better.
>
> (I have some ideas, but I think it's off-topic, here.)
Agreed off-topic ... but fixing it seems hard. What if I do:
# echo -n "L3:0=" > schemata
and then my control program dies?
> > There are already so many ways that the schemata file does not behave
> > like a regular S_IFREG file. E.g. accepting a write to just update
> > one domain in a resource: # echo L3:2=0xff > schemata
>
> That still feels basically file-like. I can write something into a
> file, then something else can read what I wrote, interpret it in any
> way it likes, and write back something different for me to read.
>
> In our case, it is as if after each write() the kernel magically reads
> and rewrites the file before userspace gets a chance to do anything
> else. This doesn't work as a protocol between userspace processes, but
> the kernel can pull tricks that are not available to userspace -- so it
> can be made to work for user <-> kernel protocols (modulo the issues
> about write() boundaries etc.)
>
> > So describe schemata in terms of writing "update commands" rather
> > than "Lines"?
>
> That's reasonable. In practice, each line written is a request to the
> kernel to do something, but it's already the case that the kernel
> doesn't necessarily do exactly what was asked for (due to rounding,
> etc.)
>
>
> Overall, I think the current state of play is that we need to consider
> the lines to be independent "commands", and execute them in the order
> given.
>
> That's the model I've been assuming here.
>
>
> > > We also cannot currently rely on userspace closing the fd between
> > > "transactions". We never told userspace to do that, previously. We
> > > could make a new requirement, but it feels unexpected/unreasonable (?)
> > >
> > > > >
> > > > > "So, from now on, only write the things that you actually want to set."
> > > > >
> > > > > Does that sound about right?
> > > >
> > > > Users might still use their favorite editor on the schemata file and
> > > > so write everything, while only changing a subset. So if we don't go
> > > > for the full two-phase update I describe above this would be:
> > > >
> > > > "only *change* the things that you actually want to set".
> >
> > I misremembered where the check for "did the user change the value"
> > happened. I thought it was during parsing, but it is actually in
> > resctrl_arch_update_domains() after all input parsing is complete
> > and resctrl is applying changes. So unless we change things to work
> > the way I hallucinated, then ordering does matter the way you
> > described.
>
> Ah, right.
>
> There would be different ways to do this, but yes, that was my
> understanding of how things work today.
>
> > >
> > > [...]
> > >
> > > > -Tony
> > >
> > > This works if the schemata file is output in the right order (and the
> > > user doesn't change the order):
> > >
> > > # cat schemata
> > > MB:0=100;1=100
> > > # MB_HW:0=1024;1=1024
> > >
> > > ->
> > >
> > > # cat <<EOF >schemata
> > > MB:0=100;1=100
> > > MB_HW:0=512,1=512
> > > EOF
> > >
> > > ... though it may still be inefficient, if the lines are not staged
> > > together. The hardware memory bandwidth controls may get programmed
> > > twice, here -- though the final result is probably what was intended.
> > >
> > > I'd still prefer that we tell people that they should be doing this:
> > > # cat <<EOF >schemata
> > > MB_HW:0=512,1=512
> > > EOF
> > >
> > > ...if they are really tyring to set MB_HW and don't care about the
> > > effect on MB?
> >
> > I'm starting to worry about this co-existence of old/new syntax for
> > Intel region aware. Life seems simple if there is only one MB_HW
> > connected to the legacy "MB". Updates to either will make both
> > appear with new values when the schemata is read. E.g.
> >
> > # cat schemata
> > MB:0=100
> > #MB_HW=255
> >
> > # echo MB:0=50 > schemata
> >
> > # cat schemata
> > MB:0=50
> > #MB_HW=127
> >
> > But Intel will have several MB_HW controls, one for each region.
> > [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here]
> >
> > # cat schemata
> > MB:0=100
> > #MB_HW0=255
> > #MB_HW1=255
> > #MB_HW2=255
> > #MB_HW3=255
> >
> > If the user sets just one of the HW controls:
> >
> > # echo MB_HW1=64
> >
> > what should resctrl display for the legacy "MB:" line?
> >
> > -Tony
>
> Erm, good question. I hadn't though too carefully about the region-
> aware case.
>
> I think it's reasonable to expect software that writes MB_HW<n>
> independently to pay attention only to these specific schemata when
> reading back -- a bit like accessing a C union.
>
> # echo 'MB:0=100' >schemata
> # cat schemata
> ->
> MB:0=100
> # MB_HW:0=255
> # MB_HW0:0=255
> # MB_HW1:0=255
> # MB_HW2:0=255
> # MB_HW3:0=255
>
> # echo 'MB:0=100' >schemata
> # cat schemata
> ->
> MB:0=50
> # MB_HW:0=128
> # MB_HW0:0=128
> # MB_HW1:0=128
> # MB_HW2:0=128
> # MB_HW3:0=128
>
> # echo 'MB_HW:0=127' >schemata
> # cat schemata
> ->
> MB:0=50
> # MB_HW:0=127
> # MB_HW0:0=127
> # MB_HW1:0=127
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> # echo 'MB_HW1:0=64' >schemata
> # cat schemata
> ->
> MB:0=???
> # MB_HW:0=???
> # MB_HW0:0=127
> # MB_HW1:0=64
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> The rules for populating the ??? entries could be designed to be
> somewhat intuitive, or we could just do the easiest thing.
>
> So, could we just pick one, fixed, region to read the MB_HW value from?
> Say, MB_HW0:
>
> MB:0=50
> # MB_HW:0=127
> # MB_HW0:0=127
> # MB_HW1:0=64
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> Or take the average across all regions:
>
> MB:0=44
> # MB_HW:0=111
> # MB_HW0:0=127
> # MB_HW1:0=64
> # MB_HW2:0=127
> # MB_HW3:0=127
>
> The latter may be more costly or complex to implement, and I don't
> know whether it is really useful. Software that knows about the
> MB_HW<n> entries also knows that once you have looked at these, MB_HW
> and MB tell you nothing else.
>
> What do you think?
>
> I'm wondering whether setting the MB_HW<n> independently may be quite a
> specialised use case, which not everyone will want/need to do, but
> that's an assumption on my part.
It's difficult to guess what users will want to do. But it is likely
the case that total available bandwidth to regions will be different
(local DDR > remote DDR > CXL). So while the system will boot up with
no throttling on any region, it may be useful to enforce more throttling
on access to the slower regions.
Rather than trying to make up some number to fill in the ?? for the MB:
line, another option would be to stop showing the legacy MB: line in schemata
as soon as the user shows they know about the direct HW access mode
by writing any of the HW lines.
Any sysadmin trying to mix and match legacy access with direct HW access
is going to run into problems very quickly. In the spirit of not giving
them the cable to connect mains to the UART, perhaps removing the
foot-gun from the table might be a good option?
> Cheers
> ---Dave
-Tony
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
2025-10-22 16:21 ` Luck, Tony
@ 2025-10-23 14:04 ` Dave Martin
0 siblings, 0 replies; 52+ messages in thread
From: Dave Martin @ 2025-10-23 14:04 UTC (permalink / raw)
To: Luck, Tony
Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Jonathan Corbet, x86, linux-doc
Hi Tony,
On Wed, Oct 22, 2025 at 09:21:03AM -0700, Luck, Tony wrote:
> Hi Dave,
>
> On Wed, Oct 22, 2025 at 03:58:08PM +0100, Dave Martin wrote:
> > Hi Tony,
> >
> > On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote:
[...]
> > <soapbox>
> >
> > We could, in the same way that a vendor could wire a UART directly to
> > the pins of a regular mains power plug. They could stick a big label
> > on it saying exactly how the pins should be hooked up to another low-
> > voltage UART and not plugged into a mains power outlet... but you know
> > what's going to happen.
>
> The PDP 11/03 for undegraduate Comp Sci student use at my univeristy had allegedly
> been student proofed against such things. Oral history said you could wire 240V
> mains across input pins to get a 50 Hz clock. I didn't test this theory.
Now, there's an idea...
> > The whole point of a file-like interface is that the user doesn't (or
> > shouldn't) have to craft I/O directly at the syscall level. If they
> > have to do that, then the reasons for not relying on ioctl() or a
> > binary protocol melt away (like that UART).
> >
> > Because the easy, unsafe way of working with these files almost always
> > works, people are almost certainly going to use it, even if we tell
> > them not to (IMHO).
> >
> > </soapbox>
> >
> >
> > That said, for practical purposes, the interface is reliable enough
> > (for now). We probably shouldn't mess with it unless we can come up
> > with something that is clearly better.
> >
> > (I have some ideas, but I think it's off-topic, here.)
>
> Agreed off-topic ... but fixing it seems hard. What if I do:
>
> # echo -n "L3:0=" > schemata
>
> and then my control program dies?
Probably nothing?
In my hack for this, I buffered a partial line for each open struct file.
If the struct file survives the terminated program, something else
could append more to the incomplete line through any fd still open on
the struct file (as in my { { echo; ... echo; } >schememta; } shell
example).
Otherwise, when the file is closed with an incomplete line, an error
could be reported through close(). I implemented this, but it turns
out not to be a magic bullet -- lots of software doesn't check the
return value from close() / fclose(), and Linux's version of dup2()
just silently loses close-time errors on the fd being clobbered.
(dash, and probably other shells, undo redirections using dup2().
Dupping the victim fd before the dup2(), so that it can be closed
separately, can help -- as documented in the dup2() man page. But as
of today, most software probably doesn't do this. Some OSes seem to
have different dup2() behaviour that doesn't suffer from this problem.)
Anyway, all in all, I wasn't convinced that this approach created fewer
problems than it solved...
[...]
> > > I'm starting to worry about this co-existence of old/new syntax for
> > > Intel region aware. Life seems simple if there is only one MB_HW
> > > connected to the legacy "MB". Updates to either will make both
> > > appear with new values when the schemata is read. E.g.
> > >
> > > # cat schemata
> > > MB:0=100
> > > #MB_HW=255
> > >
> > > # echo MB:0=50 > schemata
> > >
> > > # cat schemata
> > > MB:0=50
> > > #MB_HW=127
> > >
> > > But Intel will have several MB_HW controls, one for each region.
> > > [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here]
> > >
> > > # cat schemata
> > > MB:0=100
> > > #MB_HW0=255
> > > #MB_HW1=255
> > > #MB_HW2=255
> > > #MB_HW3=255
> > >
> > > If the user sets just one of the HW controls:
> > >
> > > # echo MB_HW1=64
> > >
> > > what should resctrl display for the legacy "MB:" line?
> > >
> > > -Tony
> >
> > Erm, good question. I hadn't though too carefully about the region-
> > aware case.
> >
> > I think it's reasonable to expect software that writes MB_HW<n>
> > independently to pay attention only to these specific schemata when
> > reading back -- a bit like accessing a C union.
> >
> > # echo 'MB:0=100' >schemata
> > # cat schemata
> > ->
> > MB:0=100
> > # MB_HW:0=255
> > # MB_HW0:0=255
> > # MB_HW1:0=255
> > # MB_HW2:0=255
> > # MB_HW3:0=255
> >
> > # echo 'MB:0=100' >schemata
> > # cat schemata
> > ->
> > MB:0=50
> > # MB_HW:0=128
> > # MB_HW0:0=128
> > # MB_HW1:0=128
> > # MB_HW2:0=128
> > # MB_HW3:0=128
> >
> > # echo 'MB_HW:0=127' >schemata
> > # cat schemata
> > ->
> > MB:0=50
> > # MB_HW:0=127
> > # MB_HW0:0=127
> > # MB_HW1:0=127
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > # echo 'MB_HW1:0=64' >schemata
> > # cat schemata
> > ->
> > MB:0=???
> > # MB_HW:0=???
> > # MB_HW0:0=127
> > # MB_HW1:0=64
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > The rules for populating the ??? entries could be designed to be
> > somewhat intuitive, or we could just do the easiest thing.
> >
> > So, could we just pick one, fixed, region to read the MB_HW value from?
> > Say, MB_HW0:
> >
> > MB:0=50
> > # MB_HW:0=127
> > # MB_HW0:0=127
> > # MB_HW1:0=64
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > Or take the average across all regions:
> >
> > MB:0=44
> > # MB_HW:0=111
> > # MB_HW0:0=127
> > # MB_HW1:0=64
> > # MB_HW2:0=127
> > # MB_HW3:0=127
> >
> > The latter may be more costly or complex to implement, and I don't
> > know whether it is really useful. Software that knows about the
> > MB_HW<n> entries also knows that once you have looked at these, MB_HW
> > and MB tell you nothing else.
> >
> > What do you think?
> >
> > I'm wondering whether setting the MB_HW<n> independently may be quite a
> > specialised use case, which not everyone will want/need to do, but
> > that's an assumption on my part.
>
> It's difficult to guess what users will want to do. But it is likely
> the case that total available bandwidth to regions will be different
> (local DDR > remote DDR > CXL). So while the system will boot up with
> no throttling on any region, it may be useful to enforce more throttling
> on access to the slower regions.
>
> Rather than trying to make up some number to fill in the ?? for the MB:
> line, another option would be to stop showing the legacy MB: line in schemata
> as soon as the user shows they know about the direct HW access mode
> by writing any of the HW lines.
>
> Any sysadmin trying to mix and match legacy access with direct HW access
> is going to run into problems very quickly. In the spirit of not giving
> them the cable to connect mains to the UART, perhaps removing the
> foot-gun from the table might be a good option?
>
> -Tony
Quite possibly.
Ideally, we'd have some kind of generic interface, but (as with "MB")
there's always the risk that the hardware evolves in directions that
don't fit the abstraction.
For now, I will try to refocus the discussion back onto the schema
description topic. I think that's probably the easiest thing to get
nailed down before we try to figure out how to deal with the "shadow
schema" issue.
Cheers
---Dave
^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2025-10-23 14:04 UTC | newest]
Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin
2025-09-12 22:19 ` Reinette Chatre
2025-09-22 14:39 ` Dave Martin
2025-09-23 17:27 ` Reinette Chatre
2025-09-25 12:46 ` Dave Martin
2025-09-25 20:53 ` Reinette Chatre
2025-09-25 21:35 ` Luck, Tony
2025-09-25 22:18 ` Reinette Chatre
2025-09-29 13:08 ` Dave Martin
2025-09-29 12:43 ` Dave Martin
2025-09-29 15:38 ` Reinette Chatre
2025-09-29 16:10 ` Dave Martin
2025-10-15 15:18 ` Dave Martin
2025-10-16 15:57 ` Reinette Chatre
2025-10-17 15:52 ` Dave Martin
2025-09-22 15:04 ` Dave Martin
2025-09-25 22:58 ` Luck, Tony
2025-09-29 9:19 ` Chen, Yu C
2025-09-29 14:13 ` Dave Martin
2025-09-29 16:23 ` Luck, Tony
2025-09-30 11:02 ` Chen, Yu C
2025-09-30 16:08 ` Luck, Tony
2025-09-30 4:43 ` Chen, Yu C
2025-09-30 15:55 ` Dave Martin
2025-10-01 12:13 ` Chen, Yu C
2025-10-02 15:40 ` Dave Martin
2025-10-02 16:43 ` Luck, Tony
2025-09-29 13:56 ` Dave Martin
2025-09-29 16:09 ` Reinette Chatre
2025-09-30 15:40 ` Dave Martin
2025-10-10 16:48 ` Reinette Chatre
2025-10-11 17:15 ` Chen, Yu C
2025-10-13 15:01 ` Dave Martin
2025-10-13 14:36 ` Dave Martin
2025-10-14 22:55 ` Reinette Chatre
2025-10-15 15:47 ` Dave Martin
2025-10-15 18:48 ` Luck, Tony
2025-10-16 14:50 ` Dave Martin
2025-10-16 16:31 ` Reinette Chatre
2025-10-17 14:17 ` Dave Martin
2025-10-17 15:59 ` Reinette Chatre
2025-10-20 15:50 ` Dave Martin
2025-10-20 16:31 ` Luck, Tony
2025-10-21 14:37 ` Dave Martin
2025-10-21 20:59 ` Luck, Tony
2025-10-22 14:58 ` Dave Martin
2025-10-22 16:21 ` Luck, Tony
2025-10-23 14:04 ` Dave Martin
2025-09-29 16:37 ` Luck, Tony
2025-09-30 16:02 ` Dave Martin
2025-09-26 20:54 ` Reinette Chatre
2025-09-29 13:40 ` Dave Martin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).