* [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch @ 2025-09-02 16:24 Dave Martin 2025-09-12 22:19 ` Reinette Chatre 2025-09-22 15:04 ` Dave Martin 0 siblings, 2 replies; 52+ messages in thread From: Dave Martin @ 2025-09-02 16:24 UTC (permalink / raw) To: linux-kernel Cc: Tony Luck, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc The control value parser for the MB resource currently coerces the memory bandwidth percentage value from userspace to be an exact multiple of the bw_gran parameter. On MPAM systems, this results in somewhat worse-than-worst-case rounding, since bw_gran is in general only an approximation to the actual hardware granularity on these systems, and the hardware bandwidth allocation control value is not natively a percentage -- necessitating a further conversion in the resctrl_arch_update_domains() path, regardless of the conversion done at parse time. Allow the arch to provide its own parse-time conversion that is appropriate for the hardware, and move the existing conversion to x86. This will avoid accumulated error from rounding the value twice on MPAM systems. Clarify the documentation, but avoid overly exact promises. Clamping to bw_min and bw_max still feels generic: leave it in the core code, for now. No functional change. Signed-off-by: Dave Martin <Dave.Martin@arm.com> --- Based on v6.17-rc3. Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ the other tests except for the NONCONT_CAT tests, which do not seem to be supported in my configuration -- and have nothing to do with the code touched by this patch). Notes: I put the x86 version out of line in order to avoid having to move struct rdt_resource and its dependencies into resctrl_types.h -- which would create a lot of diff noise. Schemata writes from userspace have a high overhead in any case. For MPAM the conversion will be a no-op, because the incoming percentage from the core resctrl code needs to be converted to hardware representation in the driver anyway. Perhaps _all_ the types should move to resctrl_types.h. For now, I went for the smallest diffstat... --- Documentation/filesystems/resctrl.rst | 7 +++---- arch/x86/include/asm/resctrl.h | 2 ++ arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++ fs/resctrl/ctrlmondata.c | 2 +- include/linux/resctrl.h | 6 ++++++ 5 files changed, 18 insertions(+), 5 deletions(-) diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst index c7949dd44f2f..a1d0469d6dfb 100644 --- a/Documentation/filesystems/resctrl.rst +++ b/Documentation/filesystems/resctrl.rst @@ -143,12 +143,11 @@ with respect to allocation: user can request. "bandwidth_gran": - The granularity in which the memory bandwidth + The approximate granularity in which the memory bandwidth percentage is allocated. The allocated b/w percentage is rounded off to the next - control step available on the hardware. The - available bandwidth control steps are: - min_bandwidth + N * bandwidth_gran. + control step available on the hardware. The available + steps are at least as small as this value. "delay_linear": Indicates if the delay scale is linear or diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index feb93b50e990..8bec2b9cc503 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -18,6 +18,8 @@ */ #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) +struct rdt_resource; + /** * struct resctrl_pqr_state - State cache for the PQR MSR * @cur_rmid: The cached Resource Monitoring ID diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 1189c0df4ad7..cf9b30b5df3c 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -16,9 +16,15 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/cpu.h> +#include <linux/math.h> #include "internal.h" +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r) +{ + return roundup(val, (unsigned long)r->membw.bw_gran); +} + int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val) { diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index d98e0d2de09f..c5e73b75aaa0 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) return false; } - *data = roundup(bw, (unsigned long)r->membw.bw_gran); + *data = resctrl_arch_round_bw(bw, r); return true; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 6fb4894b8cfd..5b2a555cf2dd 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); +/* + * Round a bandwidth control value to the nearest value acceptable to + * the arch code for resource r: + */ +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); + /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs. base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00 -- 2.34.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin @ 2025-09-12 22:19 ` Reinette Chatre 2025-09-22 14:39 ` Dave Martin 2025-09-22 15:04 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-09-12 22:19 UTC (permalink / raw) To: Dave Martin, linux-kernel Cc: Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, nits: Please use the subject prefix "x86,fs/resctrl" to be consistent with other resctrl code (and was established by Arm :)). Also please use upper case for acronym mba->MBA. On 9/2/25 9:24 AM, Dave Martin wrote: > The control value parser for the MB resource currently coerces the > memory bandwidth percentage value from userspace to be an exact > multiple of the bw_gran parameter. (to help be specific) "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? > > On MPAM systems, this results in somewhat worse-than-worst-case > rounding, since bw_gran is in general only an approximation to the > actual hardware granularity on these systems, and the hardware > bandwidth allocation control value is not natively a percentage -- > necessitating a further conversion in the resctrl_arch_update_domains() > path, regardless of the conversion done at parse time. > > Allow the arch to provide its own parse-time conversion that is > appropriate for the hardware, and move the existing conversion to x86. > This will avoid accumulated error from rounding the value twice on MPAM > systems. > > Clarify the documentation, but avoid overly exact promises. > > Clamping to bw_min and bw_max still feels generic: leave it in the core > code, for now. Sounds like MPAM may be ready to start the schema parsing discussion again? I understand that MPAM has a few more ways to describe memory bandwidth as well as cache portion partitioning. Previously ([1] [2]) James mused about exposing schema format to user space, which seems like a good idea for new schema. Is this something MPAM is still considering? For example, the minimum and maximum ranges that can be specified, is this something you already have some ideas for? Have you perhaps considered Tony's RFD [3] that includes discussion on how to handle min/max ranges for bandwidth? > > No functional change. > > Signed-off-by: Dave Martin <Dave.Martin@arm.com> > > --- > > Based on v6.17-rc3. > > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > the other tests except for the NONCONT_CAT tests, which do not seem to > be supported in my configuration -- and have nothing to do with the > code touched by this patch). Is the NONCONT_CAT test failing (i.e printing "not ok")? The NONCONT_CAT tests may print error messages as debug information as part of running, but these errors are expected as part of the test. The test should accurately state whether it passed or failed though. For example, below attempts to write a non-contiguous CBM to a system that does not support non-contiguous masks. This fails as expected, error messages printed as debugging and thus the test passes with an "ok". # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected ok 5 L3_NONCONT_CAT: test > > Notes: > > I put the x86 version out of line in order to avoid having to move > struct rdt_resource and its dependencies into resctrl_types.h -- which > would create a lot of diff noise. Schemata writes from userspace have > a high overhead in any case. Sounds good, I expect compiler will inline. > > For MPAM the conversion will be a no-op, because the incoming > percentage from the core resctrl code needs to be converted to hardware > representation in the driver anyway. (addressed below) > > Perhaps _all_ the types should move to resctrl_types.h. Can surely consider when there is a good motivation. > > For now, I went for the smallest diffstat... > > --- > Documentation/filesystems/resctrl.rst | 7 +++---- > arch/x86/include/asm/resctrl.h | 2 ++ > arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 6 ++++++ > fs/resctrl/ctrlmondata.c | 2 +- > include/linux/resctrl.h | 6 ++++++ > 5 files changed, 18 insertions(+), 5 deletions(-) > > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > index c7949dd44f2f..a1d0469d6dfb 100644 > --- a/Documentation/filesystems/resctrl.rst > +++ b/Documentation/filesystems/resctrl.rst > @@ -143,12 +143,11 @@ with respect to allocation: > user can request. > > "bandwidth_gran": > - The granularity in which the memory bandwidth > + The approximate granularity in which the memory bandwidth > percentage is allocated. The allocated > b/w percentage is rounded off to the next > - control step available on the hardware. The > - available bandwidth control steps are: > - min_bandwidth + N * bandwidth_gran. > + control step available on the hardware. The available > + steps are at least as small as this value. A bit difficult to parse for me. Is "at least as small as" same as "at least"? Please note that the documentation has a section "Memory bandwidth Allocation and monitoring" that also contains these exact promises. > > "delay_linear": > Indicates if the delay scale is linear or > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h > index feb93b50e990..8bec2b9cc503 100644 > --- a/arch/x86/include/asm/resctrl.h > +++ b/arch/x86/include/asm/resctrl.h > @@ -18,6 +18,8 @@ > */ > #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) > > +struct rdt_resource; > + I'm missing something here. Why is this needed? > /** > * struct resctrl_pqr_state - State cache for the PQR MSR > * @cur_rmid: The cached Resource Monitoring ID > diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c > index 1189c0df4ad7..cf9b30b5df3c 100644 > --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c > +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c > @@ -16,9 +16,15 @@ > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > #include <linux/cpu.h> > +#include <linux/math.h> > > #include "internal.h" > > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r) > +{ > + return roundup(val, (unsigned long)r->membw.bw_gran); > +} > + > int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, > u32 closid, enum resctrl_conf_type t, u32 cfg_val) > { > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c > index d98e0d2de09f..c5e73b75aaa0 100644 > --- a/fs/resctrl/ctrlmondata.c > +++ b/fs/resctrl/ctrlmondata.c > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) > return false; > } > > - *data = roundup(bw, (unsigned long)r->membw.bw_gran); > + *data = resctrl_arch_round_bw(bw, r); Please check that function comments remain accurate after changes (specifically if making the conversion more generic as proposed below). > return true; > } > > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h > index 6fb4894b8cfd..5b2a555cf2dd 100644 > --- a/include/linux/resctrl.h > +++ b/include/linux/resctrl.h > @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, > bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); > int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); > > +/* > + * Round a bandwidth control value to the nearest value acceptable to > + * the arch code for resource r: > + */ > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); > + I do not think that resctrl should make any assumptions on what the architecture's conversion does (i.e "round"). That architecture needs to be asked to "round a bandwidth control value" also sounds strange since resctrl really should be able to do something like rounding itself. As I understand from the notes this will be a no-op for MPAM making this even more confusing. How about naming the helper something like resctrl_arch_convert_bw()? (Open to other ideas of course). If you make such a change, please check that subject of patch still fits. I think that using const to pass data to architecture is great, thanks. Reinette [1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/ [2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/ [3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-12 22:19 ` Reinette Chatre @ 2025-09-22 14:39 ` Dave Martin 2025-09-23 17:27 ` Reinette Chatre 2025-10-15 15:18 ` Dave Martin 0 siblings, 2 replies; 52+ messages in thread From: Dave Martin @ 2025-09-22 14:39 UTC (permalink / raw) To: Reinette Chatre Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, Thanks for the review. On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > Hi Dave, > > nits: > Please use the subject prefix "x86,fs/resctrl" to be consistent with other > resctrl code (and was established by Arm :)). > Also please use upper case for acronym mba->MBA. Ack (the local custom in the MPAM code is to use "mba", but arguably, the meaning is not quite the same -- I'll change it.) > On 9/2/25 9:24 AM, Dave Martin wrote: > > The control value parser for the MB resource currently coerces the > > memory bandwidth percentage value from userspace to be an exact > > multiple of the bw_gran parameter. > > (to help be specific) > "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? "bw_gran" was intended as an informal shorthand for the abstract parameter (exposed both in the field you mention and through the bandiwidth_gran file in resctrl). I can rewrite it as per your suggestion, but this could be read as excluding the bandwidth_gran file. Would it make sense just to write it out longhand? For now, I've rewritten it as follows: | The control value parser for the MB resource currently coerces the | memory bandwidth percentage value from userspace to be an exact | multiple of the bandwidth granularity parameter. | | On MPAM systems, this results in somewhat worse-than-worst-case | rounding, since the bandwidth granularity advertised to resctrl by the | MPAM driver is in general only an approximation [...] (I'm happy to go with your suggestion if you're not keen on this, though.) > > On MPAM systems, this results in somewhat worse-than-worst-case > > rounding, since bw_gran is in general only an approximation to the > > actual hardware granularity on these systems, and the hardware > > bandwidth allocation control value is not natively a percentage -- > > necessitating a further conversion in the resctrl_arch_update_domains() > > path, regardless of the conversion done at parse time. > > > > Allow the arch to provide its own parse-time conversion that is > > appropriate for the hardware, and move the existing conversion to x86. > > This will avoid accumulated error from rounding the value twice on MPAM > > systems. > > > > Clarify the documentation, but avoid overly exact promises. > > > > Clamping to bw_min and bw_max still feels generic: leave it in the core > > code, for now. > > Sounds like MPAM may be ready to start the schema parsing discussion again? > I understand that MPAM has a few more ways to describe memory bandwidth as > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > schema format to user space, which seems like a good idea for new schema. My own ideas in this area are a little different, though I agree with the general idea. I'll respond separately on that, to avoid this thread getting off-topic. For this patch, my aim was to avoid changing anything unnecessarily. > Is this something MPAM is still considering? For example, the minimum > and maximum ranges that can be specified, is this something you already > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes > discussion on how to handle min/max ranges for bandwidth? This is another thing that we probably do want to support at some point, but it feels like a different thing from the minimum and maximum bounds acceptable to an individual schema -- especially since in the hardware they may behave more like trigger points than hard limits. Again, I'll respond separately. [...] > > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > > the other tests except for the NONCONT_CAT tests, which do not seem to > > be supported in my configuration -- and have nothing to do with the > > code touched by this patch). > > Is the NONCONT_CAT test failing (i.e printing "not ok")? > > The NONCONT_CAT tests may print error messages as debug information as part of > running, but these errors are expected as part of the test. The test should accurately > state whether it passed or failed though. For example, below attempts to write > a non-contiguous CBM to a system that does not support non-contiguous masks. > This fails as expected, error messages printed as debugging and thus the test passes > with an "ok". > > # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument > # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected > ok 5 L3_NONCONT_CAT: test I don't think that this was anything to do with my changes, but I don't still seem to have the test output. (Since this test has to do with bitmap schemata (?), it seemed unlikely to be affected by changes to bw_validate().) I'll need to re-test with and without this patch to check whether it makes any difference. > > Notes: > > > > I put the x86 version out of line in order to avoid having to move > > struct rdt_resource and its dependencies into resctrl_types.h -- which > > would create a lot of diff noise. Schemata writes from userspace have > > a high overhead in any case. > > Sounds good, I expect compiler will inline. The function and caller are in separate translation units, so unless LTO is used, I don't think the function will be inlined. > > > > For MPAM the conversion will be a no-op, because the incoming > > percentage from the core resctrl code needs to be converted to hardware > > representation in the driver anyway. > > (addressed below) > > > > > Perhaps _all_ the types should move to resctrl_types.h. > > Can surely consider when there is a good motivation. > > > > > For now, I went for the smallest diffstat... I'll assume the motivation is not strong enough for now, but shout if you disagree. [...] > > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > > index c7949dd44f2f..a1d0469d6dfb 100644 > > --- a/Documentation/filesystems/resctrl.rst > > +++ b/Documentation/filesystems/resctrl.rst > > @@ -143,12 +143,11 @@ with respect to allocation: > > user can request. > > > > "bandwidth_gran": > > - The granularity in which the memory bandwidth > > + The approximate granularity in which the memory bandwidth > > percentage is allocated. The allocated > > b/w percentage is rounded off to the next > > - control step available on the hardware. The > > - available bandwidth control steps are: > > - min_bandwidth + N * bandwidth_gran. > > + control step available on the hardware. The available > > + steps are at least as small as this value. > > A bit difficult to parse for me. > Is "at least as small as" same as "at least"? It was supposed to mean: "The available steps are no larger than this value." Formally My expectation is that this value is the smallest integer number of percent which is not smaller than the apparent size of any individual rounding step. Equivalently, this is the smallest number g for which writing "MB: 0=x" and "MB: 0=y" yield different configurations for every in-range x and where y = x + g and y is also in-range. That's a bit of a mouthful, though. If you can think of a more succinct way of putting it, I'm open to suggestions! > Please note that the documentation has a section "Memory bandwidth Allocation > and monitoring" that also contains these exact promises. Hmmm, somehow I completely missed that. Does the following make sense? Ideally, there would be a simpler way to describe the discrepancy between the reported and actual values of bw_gran... | Memory bandwidth Allocation and monitoring | ========================================== | | [...] | | The minimum bandwidth percentage value for each cpu model is predefined | and can be looked up through "info/MB/min_bandwidth". The bandwidth | granularity that is allocated is also dependent on the cpu model and can | be looked up at "info/MB/bandwidth_gran". The available bandwidth | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded | -to the next control step available on the hardware. | +control steps are: min_bw + N * (bw_gran - e), where e is a | +non-negative, hardware-defined real constant that is less than 1. | +Intermediate values are rounded to the next control step available on | +the hardware. | + | +At the time of writing, the constant e referred to in the preceding | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran | +describes the step size exactly), but this may not be the case on other | +hardware when the actual granularity is not an exact divisor of 100. > > > > > "delay_linear": > > Indicates if the delay scale is linear or > > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h > > index feb93b50e990..8bec2b9cc503 100644 > > --- a/arch/x86/include/asm/resctrl.h > > +++ b/arch/x86/include/asm/resctrl.h > > @@ -18,6 +18,8 @@ > > */ > > #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) > > > > +struct rdt_resource; > > + > > I'm missing something here. Why is this needed? Oops, it's not. This got left behind from when I had the function in-line here. Removed. [...] > > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c > > index d98e0d2de09f..c5e73b75aaa0 100644 > > --- a/fs/resctrl/ctrlmondata.c > > +++ b/fs/resctrl/ctrlmondata.c > > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) > > return false; > > } > > > > - *data = roundup(bw, (unsigned long)r->membw.bw_gran); > > + *data = resctrl_arch_round_bw(bw, r); > > Please check that function comments remain accurate after changes (specifically > if making the conversion more generic as proposed below). I hoped that the comment for this function was still applicable, though it can probably be improved. How about the following? | - * hardware. The allocated bandwidth percentage is rounded to the next | - * control step available on the hardware. | + * hardware. The allocated bandwidth percentage is converted as | + * appropriate for consumption by the specific hardware driver. [...] > > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h > > index 6fb4894b8cfd..5b2a555cf2dd 100644 > > --- a/include/linux/resctrl.h > > +++ b/include/linux/resctrl.h > > @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, > > bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); > > int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); > > > > +/* > > + * Round a bandwidth control value to the nearest value acceptable to > > + * the arch code for resource r: > > + */ > > +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); > > + > > I do not think that resctrl should make any assumptions on what the > architecture's conversion does (i.e "round"). That architecture needs to be > asked to "round a bandwidth control value" also sounds strange since resctrl really > should be able to do something like rounding itself. As I understand from > the notes this will be a no-op for MPAM making this even more confusing. > > How about naming the helper something like resctrl_arch_convert_bw()? > (Open to other ideas of course). > > If you make such a change, please check that subject of patch still fits. I struggled a bit with the name. Really, this is converting the value to an intermediate form (which might or might not involve rounding). For historical reasons, this is a value suitable for writing directly to the relevant x86 MSR without any further interpretation. For MPAM, it is convenient to do this conversion later, rather than during parsing of the value. Would a name like resctrl_arch_preconvert_bw() be acceptable? This isn't more informative than your suggestion regarding what the conversion is expected to do, but may convey the expectation that the output value may still not be in its final (i.e., hardware) form. > I think that using const to pass data to architecture is great, thanks. > > Reinette I try to constify by default when straightforward to do so, since the compiler can then find which cases need to change; the reverse direction is harder to automate... Cheers ---Dave [...] > [1] https://lore.kernel.org/lkml/fa93564a-45b0-ccdd-c139-ae4867eacfb5@arm.com/ > [2] https://lore.kernel.org/all/acefb432-6388-44ed-b444-1e52335c6c3d@arm.com/ > [3] https://lore.kernel.org/lkml/Z_mB-gmQe_LR4FWP@agluck-desk3/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-22 14:39 ` Dave Martin @ 2025-09-23 17:27 ` Reinette Chatre 2025-09-25 12:46 ` Dave Martin 2025-10-15 15:18 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-09-23 17:27 UTC (permalink / raw) To: Dave Martin Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 9/22/25 7:39 AM, Dave Martin wrote: > Hi Reinette, > > Thanks for the review. > > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >> Hi Dave, >> >> nits: >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other >> resctrl code (and was established by Arm :)). >> Also please use upper case for acronym mba->MBA. > > Ack (the local custom in the MPAM code is to use "mba", but arguably, > the meaning is not quite the same -- I'll change it.) I am curious what the motivation is for the custom? Knowing this will help me to keep things consistent when the two worlds meet. > >> On 9/2/25 9:24 AM, Dave Martin wrote: >>> The control value parser for the MB resource currently coerces the >>> memory bandwidth percentage value from userspace to be an exact >>> multiple of the bw_gran parameter. >> >> (to help be specific) >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? > > "bw_gran" was intended as an informal shorthand for the abstract > parameter (exposed both in the field you mention and through the > bandiwidth_gran file in resctrl). I do not see a need for being abstract since the bandwidth_gran file exposes the field verbatim. > > I can rewrite it as per your suggestion, but this could be read as > excluding the bandwidth_gran file. Would it make sense just to write > it out longhand? For now, I've rewritten it as follows: Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran it is not clear to me how being specific excludes the bandwidth_gran file. > > | The control value parser for the MB resource currently coerces the > | memory bandwidth percentage value from userspace to be an exact > | multiple of the bandwidth granularity parameter. If you want to include the bandwidth_gran file then the above could be something like: The control value parser for the MB resource coerces the memory bandwidth percentage value from userspace to be an exact multiple of the bandwidth granularity parameter that is exposed by the bandwidth_gran resctrl file. I still think that replacing "the bandwidth granularity parameter" with "rdt_resource::resctrl_membw::bw_gran" will help to be more specific. > | > | On MPAM systems, this results in somewhat worse-than-worst-case > | rounding, since the bandwidth granularity advertised to resctrl by the > | MPAM driver is in general only an approximation [...] > > (I'm happy to go with your suggestion if you're not keen on this, > though.) > >>> On MPAM systems, this results in somewhat worse-than-worst-case >>> rounding, since bw_gran is in general only an approximation to the >>> actual hardware granularity on these systems, and the hardware >>> bandwidth allocation control value is not natively a percentage -- >>> necessitating a further conversion in the resctrl_arch_update_domains() >>> path, regardless of the conversion done at parse time. >>> >>> Allow the arch to provide its own parse-time conversion that is >>> appropriate for the hardware, and move the existing conversion to x86. >>> This will avoid accumulated error from rounding the value twice on MPAM >>> systems. >>> >>> Clarify the documentation, but avoid overly exact promises. >>> >>> Clamping to bw_min and bw_max still feels generic: leave it in the core >>> code, for now. >> >> Sounds like MPAM may be ready to start the schema parsing discussion again? >> I understand that MPAM has a few more ways to describe memory bandwidth as >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing >> schema format to user space, which seems like a good idea for new schema. > > My own ideas in this area are a little different, though I agree with > the general idea. Should we expect a separate proposal from James? > > I'll respond separately on that, to avoid this thread getting off-topic. Much appreciated. > > For this patch, my aim was to avoid changing anything unnecessarily. Understood. More below as I try to understand the details but it does not really sound as though the current interface works that great for MPAM. If I understand correctly this patch enables MPAM to use existing interface for its memory bandwidth allocations but doing so does not enable users to obtain benefit of hardware capabilities. For that users would want to use the new interface? >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ >>> the other tests except for the NONCONT_CAT tests, which do not seem to >>> be supported in my configuration -- and have nothing to do with the >>> code touched by this patch). >> >> Is the NONCONT_CAT test failing (i.e printing "not ok")? >> >> The NONCONT_CAT tests may print error messages as debug information as part of >> running, but these errors are expected as part of the test. The test should accurately >> state whether it passed or failed though. For example, below attempts to write >> a non-contiguous CBM to a system that does not support non-contiguous masks. >> This fails as expected, error messages printed as debugging and thus the test passes >> with an "ok". >> >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected >> ok 5 L3_NONCONT_CAT: test > > I don't think that this was anything to do with my changes, but I don't > still seem to have the test output. (Since this test has to do with > bitmap schemata (?), it seemed unlikely to be affected by changes to > bw_validate().) I agree that this should not have anything to do with this patch. My concern is that I understood that the test failed for a feature that is not supported. If this is the case then there may be a problem with the test. The test should not fail if the feature is not supported but instead skip the test. > >>> Notes: >>> >>> I put the x86 version out of line in order to avoid having to move >>> struct rdt_resource and its dependencies into resctrl_types.h -- which >>> would create a lot of diff noise. Schemata writes from userspace have >>> a high overhead in any case. >> >> Sounds good, I expect compiler will inline. > > The function and caller are in separate translation units, so unless > LTO is used, I don't think the function will be inlined. Thanks, yes, indeed. > >>> >>> For MPAM the conversion will be a no-op, because the incoming >>> percentage from the core resctrl code needs to be converted to hardware >>> representation in the driver anyway. >> >> (addressed below) >> >>> >>> Perhaps _all_ the types should move to resctrl_types.h. >> >> Can surely consider when there is a good motivation. >> >>> >>> For now, I went for the smallest diffstat... > > I'll assume the motivation is not strong enough for now, but shout if > you disagree. I agree. > > [...] > >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst >>> index c7949dd44f2f..a1d0469d6dfb 100644 >>> --- a/Documentation/filesystems/resctrl.rst >>> +++ b/Documentation/filesystems/resctrl.rst >>> @@ -143,12 +143,11 @@ with respect to allocation: >>> user can request. >>> >>> "bandwidth_gran": >>> - The granularity in which the memory bandwidth >>> + The approximate granularity in which the memory bandwidth >>> percentage is allocated. The allocated >>> b/w percentage is rounded off to the next >>> - control step available on the hardware. The >>> - available bandwidth control steps are: >>> - min_bandwidth + N * bandwidth_gran. >>> + control step available on the hardware. The available >>> + steps are at least as small as this value. >> >> A bit difficult to parse for me. >> Is "at least as small as" same as "at least"? > > It was supposed to mean: "The available steps are no larger than this > value." This is clear to me, especially when compared with the planned addition to "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting the paragraph below (more below). > > Formally My expectation is that this value is the smallest integer > number of percent which is not smaller than the apparent size of any > individual rounding step. Equivalently, this is the smallest number g Considering the two statements: - "The available steps are no larger than this value." - "this value ... is not smaller than the apparent size of any individual rounding step" The "not larger" and "not smaller" sounds like all these words just end up saying that this is the step size? > for which writing "MB: 0=x" and "MB: 0=y" yield different > configurations for every in-range x and where y = x + g and y is also > in-range. > > That's a bit of a mouthful, though. If you can think of a more > succinct way of putting it, I'm open to suggestions! > >> Please note that the documentation has a section "Memory bandwidth Allocation >> and monitoring" that also contains these exact promises. > > Hmmm, somehow I completely missed that. > > Does the following make sense? Ideally, there would be a simpler way > to describe the discrepancy between the reported and actual values of > bw_gran... > > | Memory bandwidth Allocation and monitoring > | ========================================== > | > | [...] > | > | The minimum bandwidth percentage value for each cpu model is predefined > | and can be looked up through "info/MB/min_bandwidth". The bandwidth > | granularity that is allocated is also dependent on the cpu model and can > | be looked up at "info/MB/bandwidth_gran". The available bandwidth > | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded > | -to the next control step available on the hardware. > | +control steps are: min_bw + N * (bw_gran - e), where e is a > | +non-negative, hardware-defined real constant that is less than 1. > | +Intermediate values are rounded to the next control step available on > | +the hardware. > | + > | +At the time of writing, the constant e referred to in the preceding > | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran > | +describes the step size exactly), but this may not be the case on other > | +hardware when the actual granularity is not an exact divisor of 100. Have you considered how to share the value of "e" with users? >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c >>> index d98e0d2de09f..c5e73b75aaa0 100644 >>> --- a/fs/resctrl/ctrlmondata.c >>> +++ b/fs/resctrl/ctrlmondata.c >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) >>> return false; >>> } >>> >>> - *data = roundup(bw, (unsigned long)r->membw.bw_gran); >>> + *data = resctrl_arch_round_bw(bw, r); >> >> Please check that function comments remain accurate after changes (specifically >> if making the conversion more generic as proposed below). > > I hoped that the comment for this function was still applicable, though > it can probably be improved. How about the following? > > | - * hardware. The allocated bandwidth percentage is rounded to the next > | - * control step available on the hardware. > | + * hardware. The allocated bandwidth percentage is converted as > | + * appropriate for consumption by the specific hardware driver. > > [...] Looks good to me. > >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h >>> index 6fb4894b8cfd..5b2a555cf2dd 100644 >>> --- a/include/linux/resctrl.h >>> +++ b/include/linux/resctrl.h >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, >>> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); >>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); >>> >>> +/* >>> + * Round a bandwidth control value to the nearest value acceptable to >>> + * the arch code for resource r: >>> + */ >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); >>> + >> >> I do not think that resctrl should make any assumptions on what the >> architecture's conversion does (i.e "round"). That architecture needs to be >> asked to "round a bandwidth control value" also sounds strange since resctrl really >> should be able to do something like rounding itself. As I understand from >> the notes this will be a no-op for MPAM making this even more confusing. >> >> How about naming the helper something like resctrl_arch_convert_bw()? >> (Open to other ideas of course). >> >> If you make such a change, please check that subject of patch still fits. > > I struggled a bit with the name. Really, this is converting the value > to an intermediate form (which might or might not involve rounding). > For historical reasons, this is a value suitable for writing directly > to the relevant x86 MSR without any further interpretation. > > For MPAM, it is convenient to do this conversion later, rather than > during parsing of the value. > > > Would a name like resctrl_arch_preconvert_bw() be acceptable? Yes. > > This isn't more informative than your suggestion regarding what the > conversion is expected to do, but may convey the expectation that the > output value may still not be in its final (i.e., hardware) form. Sounds good, yes. > >> I think that using const to pass data to architecture is great, thanks. >> >> Reinette > > I try to constify by default when straightforward to do so, since the > compiler can then find which cases need to change; the reverse > direction is harder to automate... Could you please elaborate what you mean with "reverse direction"? Thank you Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-23 17:27 ` Reinette Chatre @ 2025-09-25 12:46 ` Dave Martin 2025-09-25 20:53 ` Reinette Chatre 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-09-25 12:46 UTC (permalink / raw) To: Reinette Chatre Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/22/25 7:39 AM, Dave Martin wrote: > > Hi Reinette, > > > > Thanks for the review. > > > > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > >> Hi Dave, > >> > >> nits: > >> Please use the subject prefix "x86,fs/resctrl" to be consistent with other > >> resctrl code (and was established by Arm :)). > >> Also please use upper case for acronym mba->MBA. > > > > Ack (the local custom in the MPAM code is to use "mba", but arguably, > > the meaning is not quite the same -- I'll change it.) > > I am curious what the motivation is for the custom? Knowing this will help > me to keep things consistent when the two worlds meet. I think this has just evolved over time. On the x86 side, MBA is a specific architectural feature, but on the MPAM side the architecture doesn't really have a name for the same thing. Memory bandwidth is a concept, but a few different types of control are defined for it, with different names. So, for the MPAM driver "mba" is more of a software concept than something in a published spec: it's the glue that attaches to "MB" resource as seen through resctrl. (This isn't official though; it's just the mental model that I have formed.) > > >>> The control value parser for the MB resource currently coerces the > >>> memory bandwidth percentage value from userspace to be an exact > >>> multiple of the bw_gran parameter. > >> > >> (to help be specific) > >> "the bw_gran parameter" -> "rdt_resource::resctrl_membw::bw_gran"? > > > > "bw_gran" was intended as an informal shorthand for the abstract > > parameter (exposed both in the field you mention and through the > > bandiwidth_gran file in resctrl). > > I do not see a need for being abstract since the bandwidth_gran file exposes > the field verbatim. Sure; that was just my thought process. > > I can rewrite it as per your suggestion, but this could be read as > > excluding the bandwidth_gran file. Would it make sense just to write > > it out longhand? For now, I've rewritten it as follows: > > Since the bandwidth_gran file exposes rdt_resource::resctrl_membw::bw_gran > it is not clear to me how being specific excludes the bandwidth_gran file. > > > > > | The control value parser for the MB resource currently coerces the > > | memory bandwidth percentage value from userspace to be an exact > > | multiple of the bandwidth granularity parameter. > > If you want to include the bandwidth_gran file then the above could be > something like: > > The control value parser for the MB resource coerces the memory > bandwidth percentage value from userspace to be an exact multiple > of the bandwidth granularity parameter that is exposed by the > bandwidth_gran resctrl file. > > I still think that replacing "the bandwidth granularity parameter" with > "rdt_resource::resctrl_membw::bw_gran" will help to be more specific. That's fine. I'll change as per your original suggestion. > > | > > | On MPAM systems, this results in somewhat worse-than-worst-case > > | rounding, since the bandwidth granularity advertised to resctrl by the > > | MPAM driver is in general only an approximation [...] > > > > (I'm happy to go with your suggestion if you're not keen on this, > > though.) > > > >>> On MPAM systems, this results in somewhat worse-than-worst-case > >>> rounding, since bw_gran is in general only an approximation to the > >>> actual hardware granularity on these systems, and the hardware > >>> bandwidth allocation control value is not natively a percentage -- > >>> necessitating a further conversion in the resctrl_arch_update_domains() > >>> path, regardless of the conversion done at parse time. > >>> > >>> Allow the arch to provide its own parse-time conversion that is > >>> appropriate for the hardware, and move the existing conversion to x86. > >>> This will avoid accumulated error from rounding the value twice on MPAM > >>> systems. > >>> > >>> Clarify the documentation, but avoid overly exact promises. > >>> > >>> Clamping to bw_min and bw_max still feels generic: leave it in the core > >>> code, for now. > >> > >> Sounds like MPAM may be ready to start the schema parsing discussion again? > >> I understand that MPAM has a few more ways to describe memory bandwidth as > >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > >> schema format to user space, which seems like a good idea for new schema. > > > > My own ideas in this area are a little different, though I agree with > > the general idea. > > Should we expect a separate proposal from James? At some point, yes. We still need to have a chat about it. Right now, I was just throwing an idea out there. > > I'll respond separately on that, to avoid this thread getting off-topic. > > Much appreciated. > > > > > For this patch, my aim was to avoid changing anything unnecessarily. > > Understood. More below as I try to understand the details but it does not > really sound as though the current interface works that great for MPAM. If I > understand correctly this patch enables MPAM to use existing interface for > its memory bandwidth allocations but doing so does not enable users to > obtain benefit of hardware capabilities. For that users would want to use > the new interface? In ideal world, probably, yes. Since not all use cases will care about full precision, the MB resource (approximated for MPAM) should be fine for a lot of people, but I expect that sooner or later somebody will want more exact control. > >>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > >>> the other tests except for the NONCONT_CAT tests, which do not seem to > >>> be supported in my configuration -- and have nothing to do with the > >>> code touched by this patch). > >> > >> Is the NONCONT_CAT test failing (i.e printing "not ok")? > >> > >> The NONCONT_CAT tests may print error messages as debug information as part of > >> running, but these errors are expected as part of the test. The test should accurately > >> state whether it passed or failed though. For example, below attempts to write > >> a non-contiguous CBM to a system that does not support non-contiguous masks. > >> This fails as expected, error messages printed as debugging and thus the test passes > >> with an "ok". > >> > >> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument > >> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected > >> ok 5 L3_NONCONT_CAT: test > > > > I don't think that this was anything to do with my changes, but I don't > > still seem to have the test output. (Since this test has to do with > > bitmap schemata (?), it seemed unlikely to be affected by changes to > > bw_validate().) > > I agree that this should not have anything to do with this patch. My concern > is that I understood that the test failed for a feature that is not supported. > If this is the case then there may be a problem with the test. The test should > not fail if the feature is not supported but instead skip the test. I'll try to capture more output from this when I re-run it, so that we can figure out what this is. > >>> Notes: > >>> > >>> I put the x86 version out of line in order to avoid having to move > >>> struct rdt_resource and its dependencies into resctrl_types.h -- which > >>> would create a lot of diff noise. Schemata writes from userspace have > >>> a high overhead in any case. > >> > >> Sounds good, I expect compiler will inline. > > > > The function and caller are in separate translation units, so unless > > LTO is used, I don't think the function will be inlined. > > Thanks, yes, indeed. > > > > >>> > >>> For MPAM the conversion will be a no-op, because the incoming > >>> percentage from the core resctrl code needs to be converted to hardware > >>> representation in the driver anyway. > >> > >> (addressed below) > >> > >>> > >>> Perhaps _all_ the types should move to resctrl_types.h. > >> > >> Can surely consider when there is a good motivation. > >> > >>> > >>> For now, I went for the smallest diffstat... > > > > I'll assume the motivation is not strong enough for now, but shout if > > you disagree. > > I agree. OK, I'll leave that as-is for now, then. > > > > [...] > > > >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > >>> index c7949dd44f2f..a1d0469d6dfb 100644 > >>> --- a/Documentation/filesystems/resctrl.rst > >>> +++ b/Documentation/filesystems/resctrl.rst > >>> @@ -143,12 +143,11 @@ with respect to allocation: > >>> user can request. > >>> > >>> "bandwidth_gran": > >>> - The granularity in which the memory bandwidth > >>> + The approximate granularity in which the memory bandwidth > >>> percentage is allocated. The allocated > >>> b/w percentage is rounded off to the next > >>> - control step available on the hardware. The > >>> - available bandwidth control steps are: > >>> - min_bandwidth + N * bandwidth_gran. > >>> + control step available on the hardware. The available > >>> + steps are at least as small as this value. > >> > >> A bit difficult to parse for me. > >> Is "at least as small as" same as "at least"? > > > > It was supposed to mean: "The available steps are no larger than this > > value." > > This is clear to me, especially when compared with the planned addition to > "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting > the paragraph below (more below). > > > > > Formally My expectation is that this value is the smallest integer > > number of percent which is not smaller than the apparent size of any > > individual rounding step. Equivalently, this is the smallest number g > > Considering the two statements: > - "The available steps are no larger than this value." > - "this value ... is not smaller than the apparent size of any individual rounding step" > > The "not larger" and "not smaller" sounds like all these words just end up saying that > this is the step size? They are intended to be the same statement: A <= B versus B >= A respectively. But I'd be the first to admit that the wording is a bit twisted! (I wouldn't be astonshed if I got something wrong somewhere.) See below for an alternative way of describing this that might be more intuitive. > > > for which writing "MB: 0=x" and "MB: 0=y" yield different > > configurations for every in-range x and where y = x + g and y is also > > in-range. > > > > That's a bit of a mouthful, though. If you can think of a more > > succinct way of putting it, I'm open to suggestions! > > > >> Please note that the documentation has a section "Memory bandwidth Allocation > >> and monitoring" that also contains these exact promises. > > > > Hmmm, somehow I completely missed that. > > > > Does the following make sense? Ideally, there would be a simpler way > > to describe the discrepancy between the reported and actual values of > > bw_gran... > > > > | Memory bandwidth Allocation and monitoring > > | ========================================== > > | > > | [...] > > | > > | The minimum bandwidth percentage value for each cpu model is predefined > > | and can be looked up through "info/MB/min_bandwidth". The bandwidth > > | granularity that is allocated is also dependent on the cpu model and can > > | be looked up at "info/MB/bandwidth_gran". The available bandwidth > > | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded > > | -to the next control step available on the hardware. > > | +control steps are: min_bw + N * (bw_gran - e), where e is a > > | +non-negative, hardware-defined real constant that is less than 1. > > | +Intermediate values are rounded to the next control step available on > > | +the hardware. > > | + > > | +At the time of writing, the constant e referred to in the preceding > > | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran > > | +describes the step size exactly), but this may not be the case on other > > | +hardware when the actual granularity is not an exact divisor of 100. > > Have you considered how to share the value of "e" with users? Perhaps introducing this "e" as an explicit parameter is a bad idea and overly formal. In practice, there are likely to various sources of skid and approximation in the hardware, so exposing an actual value may be counterproductive -- i.e., what usable guarantee is this providing to userspace, if this is likely to be swamped by approximations elsewhere? Instead, maybe we can just say something like: | The available steps are spaced at roughly equal intervals between the | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading | info/MB/bandwidth_gran gives the worst-case precision of these | interval steps, in per cent. What do you think? If that's adequate, then the wording under the definition of "bandwidth_gran" could be aligned with this. > >>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c > >>> index d98e0d2de09f..c5e73b75aaa0 100644 > >>> --- a/fs/resctrl/ctrlmondata.c > >>> +++ b/fs/resctrl/ctrlmondata.c > >>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) > >>> return false; > >>> } > >>> > >>> - *data = roundup(bw, (unsigned long)r->membw.bw_gran); > >>> + *data = resctrl_arch_round_bw(bw, r); > >> > >> Please check that function comments remain accurate after changes (specifically > >> if making the conversion more generic as proposed below). > > > > I hoped that the comment for this function was still applicable, though > > it can probably be improved. How about the following? > > > > | - * hardware. The allocated bandwidth percentage is rounded to the next > > | - * control step available on the hardware. > > | + * hardware. The allocated bandwidth percentage is converted as > > | + * appropriate for consumption by the specific hardware driver. > > > > [...] > > Looks good to me. OK. > > > >>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h > >>> index 6fb4894b8cfd..5b2a555cf2dd 100644 > >>> --- a/include/linux/resctrl.h > >>> +++ b/include/linux/resctrl.h > >>> @@ -416,6 +416,12 @@ static inline u32 resctrl_get_config_index(u32 closid, > >>> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); > >>> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); > >>> > >>> +/* > >>> + * Round a bandwidth control value to the nearest value acceptable to > >>> + * the arch code for resource r: > >>> + */ > >>> +u32 resctrl_arch_round_bw(u32 val, const struct rdt_resource *r); > >>> + > >> > >> I do not think that resctrl should make any assumptions on what the > >> architecture's conversion does (i.e "round"). That architecture needs to be > >> asked to "round a bandwidth control value" also sounds strange since resctrl really > >> should be able to do something like rounding itself. As I understand from > >> the notes this will be a no-op for MPAM making this even more confusing. > >> > >> How about naming the helper something like resctrl_arch_convert_bw()? > >> (Open to other ideas of course). > >> > >> If you make such a change, please check that subject of patch still fits. > > > > I struggled a bit with the name. Really, this is converting the value > > to an intermediate form (which might or might not involve rounding). > > For historical reasons, this is a value suitable for writing directly > > to the relevant x86 MSR without any further interpretation. > > > > For MPAM, it is convenient to do this conversion later, rather than > > during parsing of the value. > > > > > > Would a name like resctrl_arch_preconvert_bw() be acceptable? > > Yes. > > > > > This isn't more informative than your suggestion regarding what the > > conversion is expected to do, but may convey the expectation that the > > output value may still not be in its final (i.e., hardware) form. > > Sounds good, yes. OK, I'll hack that in. > > > > >> I think that using const to pass data to architecture is great, thanks. > >> > >> Reinette > > > > I try to constify by default when straightforward to do so, since the > > compiler can then find which cases need to change; the reverse > > direction is harder to automate... > > Could you please elaborate what you mean with "reverse direction"? I just meant that over-consting tends to result in violations of the language that the compiler will detect, but under-consting doesn't: static void foo(int *nonconstp, const int *constp) { *constp = 0; // compiler error (*nonconstp); // silently accpeted, though it could have been const } So, the compiler will tell you places where const needs to be removed (or something else needs to change), but to find places where const could be _added_, you have to hunt them down yourself, or use some other tool that is probably not part of the usual workflow. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-25 12:46 ` Dave Martin @ 2025-09-25 20:53 ` Reinette Chatre 2025-09-25 21:35 ` Luck, Tony 2025-09-29 12:43 ` Dave Martin 0 siblings, 2 replies; 52+ messages in thread From: Reinette Chatre @ 2025-09-25 20:53 UTC (permalink / raw) To: Dave Martin Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 9/25/25 5:46 AM, Dave Martin wrote: > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: >> On 9/22/25 7:39 AM, Dave Martin wrote: >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >>>> Hi Dave, >>>> >>>> nits: >>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other >>>> resctrl code (and was established by Arm :)). >>>> Also please use upper case for acronym mba->MBA. >>> >>> Ack (the local custom in the MPAM code is to use "mba", but arguably, >>> the meaning is not quite the same -- I'll change it.) >> >> I am curious what the motivation is for the custom? Knowing this will help >> me to keep things consistent when the two worlds meet. > > I think this has just evolved over time. On the x86 side, MBA is a > specific architectural feature, but on the MPAM side the architecture > doesn't really have a name for the same thing. Memory bandwidth is a > concept, but a few different types of control are defined for it, with > different names. > > So, for the MPAM driver "mba" is more of a software concept than > something in a published spec: it's the glue that attaches to "MB" > resource as seen through resctrl. > > (This isn't official though; it's just the mental model that I have > formed.) I see. Thank you for the details. My mental model is simpler: write acronyms in upper case. ... >>>>> On MPAM systems, this results in somewhat worse-than-worst-case >>>>> rounding, since bw_gran is in general only an approximation to the >>>>> actual hardware granularity on these systems, and the hardware >>>>> bandwidth allocation control value is not natively a percentage -- >>>>> necessitating a further conversion in the resctrl_arch_update_domains() >>>>> path, regardless of the conversion done at parse time. >>>>> >>>>> Allow the arch to provide its own parse-time conversion that is >>>>> appropriate for the hardware, and move the existing conversion to x86. >>>>> This will avoid accumulated error from rounding the value twice on MPAM >>>>> systems. >>>>> >>>>> Clarify the documentation, but avoid overly exact promises. >>>>> >>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core >>>>> code, for now. >>>> >>>> Sounds like MPAM may be ready to start the schema parsing discussion again? >>>> I understand that MPAM has a few more ways to describe memory bandwidth as >>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing >>>> schema format to user space, which seems like a good idea for new schema. >>> >>> My own ideas in this area are a little different, though I agree with >>> the general idea. >> >> Should we expect a separate proposal from James? > > At some point, yes. We still need to have a chat about it. > > Right now, I was just throwing an idea out there. Thank you very much for doing so. We are digesting it. ... >>> For this patch, my aim was to avoid changing anything unnecessarily. >> >> Understood. More below as I try to understand the details but it does not >> really sound as though the current interface works that great for MPAM. If I >> understand correctly this patch enables MPAM to use existing interface for >> its memory bandwidth allocations but doing so does not enable users to >> obtain benefit of hardware capabilities. For that users would want to use >> the new interface? > > In ideal world, probably, yes. > > Since not all use cases will care about full precision, the MB resource > (approximated for MPAM) should be fine for a lot of people, but I > expect that sooner or later somebody will want more exact control. ack. > >>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ >>>>> the other tests except for the NONCONT_CAT tests, which do not seem to >>>>> be supported in my configuration -- and have nothing to do with the >>>>> code touched by this patch). >>>> >>>> Is the NONCONT_CAT test failing (i.e printing "not ok")? >>>> >>>> The NONCONT_CAT tests may print error messages as debug information as part of >>>> running, but these errors are expected as part of the test. The test should accurately >>>> state whether it passed or failed though. For example, below attempts to write >>>> a non-contiguous CBM to a system that does not support non-contiguous masks. >>>> This fails as expected, error messages printed as debugging and thus the test passes >>>> with an "ok". >>>> >>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument >>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected >>>> ok 5 L3_NONCONT_CAT: test >>> >>> I don't think that this was anything to do with my changes, but I don't >>> still seem to have the test output. (Since this test has to do with >>> bitmap schemata (?), it seemed unlikely to be affected by changes to >>> bw_validate().) >> >> I agree that this should not have anything to do with this patch. My concern >> is that I understood that the test failed for a feature that is not supported. >> If this is the case then there may be a problem with the test. The test should >> not fail if the feature is not supported but instead skip the test. > > I'll try to capture more output from this when I re-run it, so that we > can figure out what this is. Thank you. ... >>> >>> [...] >>> >>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst >>>>> index c7949dd44f2f..a1d0469d6dfb 100644 >>>>> --- a/Documentation/filesystems/resctrl.rst >>>>> +++ b/Documentation/filesystems/resctrl.rst >>>>> @@ -143,12 +143,11 @@ with respect to allocation: >>>>> user can request. >>>>> >>>>> "bandwidth_gran": >>>>> - The granularity in which the memory bandwidth >>>>> + The approximate granularity in which the memory bandwidth >>>>> percentage is allocated. The allocated >>>>> b/w percentage is rounded off to the next >>>>> - control step available on the hardware. The >>>>> - available bandwidth control steps are: >>>>> - min_bandwidth + N * bandwidth_gran. >>>>> + control step available on the hardware. The available >>>>> + steps are at least as small as this value. >>>> >>>> A bit difficult to parse for me. >>>> Is "at least as small as" same as "at least"? >>> >>> It was supposed to mean: "The available steps are no larger than this >>> value." >> >> This is clear to me, especially when compared with the planned addition to >> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting >> the paragraph below (more below). >> >>> >>> Formally My expectation is that this value is the smallest integer >>> number of percent which is not smaller than the apparent size of any >>> individual rounding step. Equivalently, this is the smallest number g >> >> Considering the two statements: >> - "The available steps are no larger than this value." >> - "this value ... is not smaller than the apparent size of any individual rounding step" >> >> The "not larger" and "not smaller" sounds like all these words just end up saying that >> this is the step size? > > They are intended to be the same statement: A <= B versus > B >= A respectively. This is what I understood from the words ... and that made me think that it can be simplified to A = B ... but no need to digress ... onto the alternatives below ... > > But I'd be the first to admit that the wording is a bit twisted! > (I wouldn't be astonshed if I got something wrong somewhere.) > > See below for an alternative way of describing this that might be more > intuitive. > >> >>> for which writing "MB: 0=x" and "MB: 0=y" yield different >>> configurations for every in-range x and where y = x + g and y is also >>> in-range. >>> >>> That's a bit of a mouthful, though. If you can think of a more >>> succinct way of putting it, I'm open to suggestions! >>> >>>> Please note that the documentation has a section "Memory bandwidth Allocation >>>> and monitoring" that also contains these exact promises. >>> >>> Hmmm, somehow I completely missed that. >>> >>> Does the following make sense? Ideally, there would be a simpler way >>> to describe the discrepancy between the reported and actual values of >>> bw_gran... >>> >>> | Memory bandwidth Allocation and monitoring >>> | ========================================== >>> | >>> | [...] >>> | >>> | The minimum bandwidth percentage value for each cpu model is predefined >>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth >>> | granularity that is allocated is also dependent on the cpu model and can >>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth >>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded >>> | -to the next control step available on the hardware. >>> | +control steps are: min_bw + N * (bw_gran - e), where e is a >>> | +non-negative, hardware-defined real constant that is less than 1. >>> | +Intermediate values are rounded to the next control step available on >>> | +the hardware. >>> | + >>> | +At the time of writing, the constant e referred to in the preceding >>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran >>> | +describes the step size exactly), but this may not be the case on other >>> | +hardware when the actual granularity is not an exact divisor of 100. >> >> Have you considered how to share the value of "e" with users? > > Perhaps introducing this "e" as an explicit parameter is a bad idea and > overly formal. In practice, there are likely to various sources of > skid and approximation in the hardware, so exposing an actual value may > be counterproductive -- i.e., what usable guarantee is this providing > to userspace, if this is likely to be swamped by approximations > elsewhere? > > Instead, maybe we can just say something like: > > | The available steps are spaced at roughly equal intervals between the > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading > | info/MB/bandwidth_gran gives the worst-case precision of these > | interval steps, in per cent. > > What do you think? I find "worst-case precision" a bit confusing, consider for example, what would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives the upper limit of these interval steps"? I believe this matches what you mentioned a couple of messages ago: "The available steps are no larger than this value." (and "per cent" -> "percent") > > If that's adequate, then the wording under the definition of > "bandwidth_gran" could be aligned with this. I think putting together a couple of your proposals and statements while making the text more accurate may work: "bandwidth_gran": The approximate granularity in which the memory bandwidth percentage is allocated. The allocated bandwidth percentage is rounded up to the next control step available on the hardware. The available hardware steps are no larger than this value. I assume "available" is needed because, even though the steps are not larger than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" to 100% range? >>>> I think that using const to pass data to architecture is great, thanks. >>>> >>>> Reinette >>> >>> I try to constify by default when straightforward to do so, since the >>> compiler can then find which cases need to change; the reverse >>> direction is harder to automate... >> >> Could you please elaborate what you mean with "reverse direction"? > > I just meant that over-consting tends to result in violations of the > language that the compiler will detect, but under-consting doesn't: > > static void foo(int *nonconstp, const int *constp) > { > *constp = 0; // compiler error > (*nonconstp); // silently accpeted, though it could have been const > } > > So, the compiler will tell you places where const needs to be removed > (or something else needs to change), but to find places where const > could be _added_, you have to hunt them down yourself, or use some > other tool that is probably not part of the usual workflow. Got it, thanks. Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-25 20:53 ` Reinette Chatre @ 2025-09-25 21:35 ` Luck, Tony 2025-09-25 22:18 ` Reinette Chatre 2025-09-29 12:43 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Luck, Tony @ 2025-09-25 21:35 UTC (permalink / raw) To: Reinette Chatre Cc: Dave Martin, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/25/25 5:46 AM, Dave Martin wrote: > > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: > >> On 9/22/25 7:39 AM, Dave Martin wrote: > >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > >>>> Hi Dave, > >>>> > >>>> nits: > >>>> Please use the subject prefix "x86,fs/resctrl" to be consistent with other > >>>> resctrl code (and was established by Arm :)). > >>>> Also please use upper case for acronym mba->MBA. > >>> > >>> Ack (the local custom in the MPAM code is to use "mba", but arguably, > >>> the meaning is not quite the same -- I'll change it.) > >> > >> I am curious what the motivation is for the custom? Knowing this will help > >> me to keep things consistent when the two worlds meet. > > > > I think this has just evolved over time. On the x86 side, MBA is a > > specific architectural feature, but on the MPAM side the architecture > > doesn't really have a name for the same thing. Memory bandwidth is a > > concept, but a few different types of control are defined for it, with > > different names. > > > > So, for the MPAM driver "mba" is more of a software concept than > > something in a published spec: it's the glue that attaches to "MB" > > resource as seen through resctrl. > > > > (This isn't official though; it's just the mental model that I have > > formed.) > > I see. Thank you for the details. My mental model is simpler: write acronyms > in upper case. > > ... > > >>>>> On MPAM systems, this results in somewhat worse-than-worst-case > >>>>> rounding, since bw_gran is in general only an approximation to the > >>>>> actual hardware granularity on these systems, and the hardware > >>>>> bandwidth allocation control value is not natively a percentage -- > >>>>> necessitating a further conversion in the resctrl_arch_update_domains() > >>>>> path, regardless of the conversion done at parse time. > >>>>> > >>>>> Allow the arch to provide its own parse-time conversion that is > >>>>> appropriate for the hardware, and move the existing conversion to x86. > >>>>> This will avoid accumulated error from rounding the value twice on MPAM > >>>>> systems. > >>>>> > >>>>> Clarify the documentation, but avoid overly exact promises. > >>>>> > >>>>> Clamping to bw_min and bw_max still feels generic: leave it in the core > >>>>> code, for now. > >>>> > >>>> Sounds like MPAM may be ready to start the schema parsing discussion again? > >>>> I understand that MPAM has a few more ways to describe memory bandwidth as > >>>> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > >>>> schema format to user space, which seems like a good idea for new schema. > >>> > >>> My own ideas in this area are a little different, though I agree with > >>> the general idea. > >> > >> Should we expect a separate proposal from James? > > > > At some point, yes. We still need to have a chat about it. > > > > Right now, I was just throwing an idea out there. > > Thank you very much for doing so. We are digesting it. > > > ... > > >>> For this patch, my aim was to avoid changing anything unnecessarily. > >> > >> Understood. More below as I try to understand the details but it does not > >> really sound as though the current interface works that great for MPAM. If I > >> understand correctly this patch enables MPAM to use existing interface for > >> its memory bandwidth allocations but doing so does not enable users to > >> obtain benefit of hardware capabilities. For that users would want to use > >> the new interface? > > > > In ideal world, probably, yes. > > > > Since not all use cases will care about full precision, the MB resource > > (approximated for MPAM) should be fine for a lot of people, but I > > expect that sooner or later somebody will want more exact control. > > ack. > > > > >>>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > >>>>> the other tests except for the NONCONT_CAT tests, which do not seem to > >>>>> be supported in my configuration -- and have nothing to do with the > >>>>> code touched by this patch). > >>>> > >>>> Is the NONCONT_CAT test failing (i.e printing "not ok")? > >>>> > >>>> The NONCONT_CAT tests may print error messages as debug information as part of > >>>> running, but these errors are expected as part of the test. The test should accurately > >>>> state whether it passed or failed though. For example, below attempts to write > >>>> a non-contiguous CBM to a system that does not support non-contiguous masks. > >>>> This fails as expected, error messages printed as debugging and thus the test passes > >>>> with an "ok". > >>>> > >>>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument > >>>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected > >>>> ok 5 L3_NONCONT_CAT: test > >>> > >>> I don't think that this was anything to do with my changes, but I don't > >>> still seem to have the test output. (Since this test has to do with > >>> bitmap schemata (?), it seemed unlikely to be affected by changes to > >>> bw_validate().) > >> > >> I agree that this should not have anything to do with this patch. My concern > >> is that I understood that the test failed for a feature that is not supported. > >> If this is the case then there may be a problem with the test. The test should > >> not fail if the feature is not supported but instead skip the test. > > > > I'll try to capture more output from this when I re-run it, so that we > > can figure out what this is. > > Thank you. > > > ... > > >>> > >>> [...] > >>> > >>>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst > >>>>> index c7949dd44f2f..a1d0469d6dfb 100644 > >>>>> --- a/Documentation/filesystems/resctrl.rst > >>>>> +++ b/Documentation/filesystems/resctrl.rst > >>>>> @@ -143,12 +143,11 @@ with respect to allocation: > >>>>> user can request. > >>>>> > >>>>> "bandwidth_gran": > >>>>> - The granularity in which the memory bandwidth > >>>>> + The approximate granularity in which the memory bandwidth > >>>>> percentage is allocated. The allocated > >>>>> b/w percentage is rounded off to the next > >>>>> - control step available on the hardware. The > >>>>> - available bandwidth control steps are: > >>>>> - min_bandwidth + N * bandwidth_gran. > >>>>> + control step available on the hardware. The available > >>>>> + steps are at least as small as this value. > >>>> > >>>> A bit difficult to parse for me. > >>>> Is "at least as small as" same as "at least"? > >>> > >>> It was supposed to mean: "The available steps are no larger than this > >>> value." > >> > >> This is clear to me, especially when compared with the planned addition to > >> "Memory bandwidth Allocation and monitoring" ... but I do find it contradicting > >> the paragraph below (more below). > >> > >>> > >>> Formally My expectation is that this value is the smallest integer > >>> number of percent which is not smaller than the apparent size of any > >>> individual rounding step. Equivalently, this is the smallest number g > >> > >> Considering the two statements: > >> - "The available steps are no larger than this value." > >> - "this value ... is not smaller than the apparent size of any individual rounding step" > >> > >> The "not larger" and "not smaller" sounds like all these words just end up saying that > >> this is the step size? > > > > They are intended to be the same statement: A <= B versus > > B >= A respectively. > > This is what I understood from the words ... and that made me think that it > can be simplified to A = B ... but no need to digress ... onto the alternatives below ... > > > > > But I'd be the first to admit that the wording is a bit twisted! > > (I wouldn't be astonshed if I got something wrong somewhere.) > > > > See below for an alternative way of describing this that might be more > > intuitive. > > > >> > >>> for which writing "MB: 0=x" and "MB: 0=y" yield different > >>> configurations for every in-range x and where y = x + g and y is also > >>> in-range. > >>> > >>> That's a bit of a mouthful, though. If you can think of a more > >>> succinct way of putting it, I'm open to suggestions! > >>> > >>>> Please note that the documentation has a section "Memory bandwidth Allocation > >>>> and monitoring" that also contains these exact promises. > >>> > >>> Hmmm, somehow I completely missed that. > >>> > >>> Does the following make sense? Ideally, there would be a simpler way > >>> to describe the discrepancy between the reported and actual values of > >>> bw_gran... > >>> > >>> | Memory bandwidth Allocation and monitoring > >>> | ========================================== > >>> | > >>> | [...] > >>> | > >>> | The minimum bandwidth percentage value for each cpu model is predefined > >>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth > >>> | granularity that is allocated is also dependent on the cpu model and can > >>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth > >>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded > >>> | -to the next control step available on the hardware. > >>> | +control steps are: min_bw + N * (bw_gran - e), where e is a > >>> | +non-negative, hardware-defined real constant that is less than 1. > >>> | +Intermediate values are rounded to the next control step available on > >>> | +the hardware. > >>> | + > >>> | +At the time of writing, the constant e referred to in the preceding > >>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran > >>> | +describes the step size exactly), but this may not be the case on other > >>> | +hardware when the actual granularity is not an exact divisor of 100. > >> > >> Have you considered how to share the value of "e" with users? > > > > Perhaps introducing this "e" as an explicit parameter is a bad idea and > > overly formal. In practice, there are likely to various sources of > > skid and approximation in the hardware, so exposing an actual value may > > be counterproductive -- i.e., what usable guarantee is this providing > > to userspace, if this is likely to be swamped by approximations > > elsewhere? > > > > Instead, maybe we can just say something like: > > > > | The available steps are spaced at roughly equal intervals between the > > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading > > | info/MB/bandwidth_gran gives the worst-case precision of these > > | interval steps, in per cent. > > > > What do you think? > > I find "worst-case precision" a bit confusing, consider for example, what > would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives > the upper limit of these interval steps"? I believe this matches what you > mentioned a couple of messages ago: "The available steps are no larger than this > value." > > (and "per cent" -> "percent") > > > > > If that's adequate, then the wording under the definition of > > "bandwidth_gran" could be aligned with this. > > I think putting together a couple of your proposals and statements while making the > text more accurate may work: > > "bandwidth_gran": > The approximate granularity in which the memory bandwidth > percentage is allocated. The allocated bandwidth percentage > is rounded up to the next control step available on the > hardware. The available hardware steps are no larger than > this value. > > I assume "available" is needed because, even though the steps are not larger > than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" > to 100% range? What values are allowed for "bandwidth_gran"? The "Intel® Resource Director Technology (Intel® RDT) Architecture Specification" https://cdrdv2.intel.com/v1/dl/getContent/789566 describes the upcoming region aware memory bandwidth allocation controls as being a number from "1" to "Q" (enumerated in an ACPI table). First implementation looks like Q == 255 which means a granularity of 0.392% The spec has headroom to allow Q == 511. I don't expect users to need that granularity at the high bandwidth end of the range, but I do expect them to care for highly throttled background/batch jobs to make sure they can't affect performance of the high priority jobs. I'd hate to have to round all low bandwidth controls to 1% steps. > > >>>> I think that using const to pass data to architecture is great, thanks. > >>>> > >>>> Reinette > >>> > >>> I try to constify by default when straightforward to do so, since the > >>> compiler can then find which cases need to change; the reverse > >>> direction is harder to automate... > >> > >> Could you please elaborate what you mean with "reverse direction"? > > > > I just meant that over-consting tends to result in violations of the > > language that the compiler will detect, but under-consting doesn't: > > > > static void foo(int *nonconstp, const int *constp) > > { > > *constp = 0; // compiler error > > (*nonconstp); // silently accpeted, though it could have been const > > } > > > > So, the compiler will tell you places where const needs to be removed > > (or something else needs to change), but to find places where const > > could be _added_, you have to hunt them down yourself, or use some > > other tool that is probably not part of the usual workflow. > > Got it, thanks. > > Reinette > -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-25 21:35 ` Luck, Tony @ 2025-09-25 22:18 ` Reinette Chatre 2025-09-29 13:08 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-09-25 22:18 UTC (permalink / raw) To: Luck, Tony Cc: Dave Martin, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Tony, On 9/25/25 2:35 PM, Luck, Tony wrote: > On Thu, Sep 25, 2025 at 01:53:37PM -0700, Reinette Chatre wrote: >> On 9/25/25 5:46 AM, Dave Martin wrote: >>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: >>>> On 9/22/25 7:39 AM, Dave Martin wrote: >>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: ... >>>>> for which writing "MB: 0=x" and "MB: 0=y" yield different >>>>> configurations for every in-range x and where y = x + g and y is also >>>>> in-range. >>>>> >>>>> That's a bit of a mouthful, though. If you can think of a more >>>>> succinct way of putting it, I'm open to suggestions! >>>>> >>>>>> Please note that the documentation has a section "Memory bandwidth Allocation >>>>>> and monitoring" that also contains these exact promises. >>>>> >>>>> Hmmm, somehow I completely missed that. >>>>> >>>>> Does the following make sense? Ideally, there would be a simpler way >>>>> to describe the discrepancy between the reported and actual values of >>>>> bw_gran... >>>>> >>>>> | Memory bandwidth Allocation and monitoring >>>>> | ========================================== >>>>> | >>>>> | [...] >>>>> | >>>>> | The minimum bandwidth percentage value for each cpu model is predefined >>>>> | and can be looked up through "info/MB/min_bandwidth". The bandwidth >>>>> | granularity that is allocated is also dependent on the cpu model and can >>>>> | be looked up at "info/MB/bandwidth_gran". The available bandwidth >>>>> | -control steps are: min_bw + N * bw_gran. Intermediate values are rounded >>>>> | -to the next control step available on the hardware. >>>>> | +control steps are: min_bw + N * (bw_gran - e), where e is a >>>>> | +non-negative, hardware-defined real constant that is less than 1. >>>>> | +Intermediate values are rounded to the next control step available on >>>>> | +the hardware. >>>>> | + >>>>> | +At the time of writing, the constant e referred to in the preceding >>>>> | +paragraph is always zero on Intel and AMD platforms (i.e., bw_gran >>>>> | +describes the step size exactly), but this may not be the case on other >>>>> | +hardware when the actual granularity is not an exact divisor of 100. >>>> >>>> Have you considered how to share the value of "e" with users? >>> >>> Perhaps introducing this "e" as an explicit parameter is a bad idea and >>> overly formal. In practice, there are likely to various sources of >>> skid and approximation in the hardware, so exposing an actual value may >>> be counterproductive -- i.e., what usable guarantee is this providing >>> to userspace, if this is likely to be swamped by approximations >>> elsewhere? >>> >>> Instead, maybe we can just say something like: >>> >>> | The available steps are spaced at roughly equal intervals between the >>> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading >>> | info/MB/bandwidth_gran gives the worst-case precision of these >>> | interval steps, in per cent. >>> >>> What do you think? >> >> I find "worst-case precision" a bit confusing, consider for example, what >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives >> the upper limit of these interval steps"? I believe this matches what you >> mentioned a couple of messages ago: "The available steps are no larger than this >> value." >> >> (and "per cent" -> "percent") >> >>> >>> If that's adequate, then the wording under the definition of >>> "bandwidth_gran" could be aligned with this. >> >> I think putting together a couple of your proposals and statements while making the >> text more accurate may work: >> >> "bandwidth_gran": >> The approximate granularity in which the memory bandwidth >> percentage is allocated. The allocated bandwidth percentage >> is rounded up to the next control step available on the >> hardware. The available hardware steps are no larger than >> this value. >> >> I assume "available" is needed because, even though the steps are not larger >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" >> to 100% range? > > What values are allowed for "bandwidth_gran"? The "Intel® Resource This is a property of the MB resource where the ABI is to express allocations as a percentage. Current doc: "bandwidth_gran": The granularity in which the memory bandwidth percentage is allocated. The allocated b/w percentage is rounded off to the next control step available on the hardware. The available bandwidth control steps are: min_bandwidth + N * bandwidth_gran. I do not expect we can switch it to fractions so I would say that integer values are allowed, starting at 1. I understand that the MB resource on AMD supports different ranges and I find that ABI discrepancy unfortunate. I do not think this should be seen as an opportunity that "anything goes" when it comes to MB and used as an excuse to pile on another range of hardware dependent inputs. Instead I believe we should keep MB interface as-is and instead work on a generic interface that enables user space to interact with resctrl to have benefit of all hardware capabilities without needing to know which hardware is underneath. > Director Technology (Intel® RDT) Architecture Specification" > > https://cdrdv2.intel.com/v1/dl/getContent/789566 > > describes the upcoming region aware memory bandwidth allocation > controls as being a number from "1" to "Q" (enumerated in an ACPI > table). First implementation looks like Q == 255 which means a > granularity of 0.392% The spec has headroom to allow Q == 511. > > I don't expect users to need that granularity at the high bandwidth > end of the range, but I do expect them to care for highly throttled > background/batch jobs to make sure they can't affect performance of > the high priority jobs. > > I'd hate to have to round all low bandwidth controls to 1% steps. This is the limitation if choosing to expose this feature as an MB resource and seems to be the same problem that Dave is facing. For finer granularity allocations I expect that we would need a new schema/resource backed by new properties as proposed by Dave in https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ This will require updates to user space (that will anyway be needed if wedging another non-ABI input into MB). Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-25 22:18 ` Reinette Chatre @ 2025-09-29 13:08 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-09-29 13:08 UTC (permalink / raw) To: Reinette Chatre, Luck, Tony Cc: linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, Tony, On Thu, Sep 25, 2025 at 03:18:51PM -0700, Reinette Chatre wrote: > Hi Tony, > > On 9/25/25 2:35 PM, Luck, Tony wrote: [...] > > Director Technology (Intel® RDT) Architecture Specification" > > > > https://cdrdv2.intel.com/v1/dl/getContent/789566 > > > > describes the upcoming region aware memory bandwidth allocation > > controls as being a number from "1" to "Q" (enumerated in an ACPI > > table). First implementation looks like Q == 255 which means a > > granularity of 0.392% The spec has headroom to allow Q == 511. That does look like it would benefit from exposing the hardware field without rounding (similarly as for MPAM). Is the relationship between this value and the expected memory system throughput actually defined anywhere? If the expected throughput is exactly proportional to this value, or a reasonable approximation to this, then that it simple -- but I can't see it actually stated. when a spec suggests a need to divide by (2^N - 1), I do wonder whether that it what they _really_ meant (and whether hardware will just do the obvious cheap approximation in defiance of the spec). > > > > I don't expect users to need that granularity at the high bandwidth > > end of the range, but I do expect them to care for highly throttled > > background/batch jobs to make sure they can't affect performance of > > the high priority jobs. A case where it _might_ matter is where there is a non-trivial number of jobs, and an attempt is made to share bandwidth among them. Although it may not matter exactly how much bandwidth is given to each job, the rounding errors may accumulate so that they add up to significantly more than or less than 100% in total. This feels undesirable. Rounding off the value in the interface effectively makes it impossible for portable software to avoid this problem... > > I'd hate to have to round all low bandwidth controls to 1% steps. +1! (No pun intended.) > This is the limitation if choosing to expose this feature as an MB resource > and seems to be the same problem that Dave is facing. For finer granularity > allocations I expect that we would need a new schema/resource backed by new > properties as proposed by Dave in > https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ > This will require updates to user space (that will anyway be needed if wedging > another non-ABI input into MB). > > Reinette Ack; while we could add decimal places to bandwidth_gran as reported to userspace, we don't know that software isn't going to choke on that. Plus, we could need to add precision to the control values too -- it's no good advertising 0.5% guanularity when the MB schema only accepts/reports integers. Software that parses anything as (potentially) a real number might work transparently, but we didn't warn users that they might need to do that... Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-25 20:53 ` Reinette Chatre 2025-09-25 21:35 ` Luck, Tony @ 2025-09-29 12:43 ` Dave Martin 2025-09-29 15:38 ` Reinette Chatre 1 sibling, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-09-29 12:43 UTC (permalink / raw) To: Reinette Chatre Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote: > Hi Dave, > > On 9/25/25 5:46 AM, Dave Martin wrote: > > On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: > >> On 9/22/25 7:39 AM, Dave Martin wrote: > >>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > >>>> Hi Dave, [...] > >>>> Also please use upper case for acronym mba->MBA. > >>> > >>> Ack (the local custom in the MPAM code is to use "mba", but arguably, > >>> the meaning is not quite the same -- I'll change it.) > >> > >> I am curious what the motivation is for the custom? Knowing this will help > >> me to keep things consistent when the two worlds meet. > > > > I think this has just evolved over time. On the x86 side, MBA is a > > specific architectural feature, but on the MPAM side the architecture > > doesn't really have a name for the same thing. Memory bandwidth is a > > concept, but a few different types of control are defined for it, with > > different names. > > > > So, for the MPAM driver "mba" is more of a software concept than > > something in a published spec: it's the glue that attaches to "MB" > > resource as seen through resctrl. > > > > (This isn't official though; it's just the mental model that I have > > formed.) > > I see. Thank you for the details. My mental model is simpler: write acronyms > in upper case. Generally, I agree, although I'm not sure whether that acronym belongs in the MPAM-specific code. For this patch, though, that's irrelevant. I've changed it to "MBA" as requested. [...] > >> really sound as though the current interface works that great for MPAM. If I > >> understand correctly this patch enables MPAM to use existing interface for > >> its memory bandwidth allocations but doing so does not enable users to > >> obtain benefit of hardware capabilities. For that users would want to use > >> the new interface? > > > > In ideal world, probably, yes. > > > > Since not all use cases will care about full precision, the MB resource > > (approximated for MPAM) should be fine for a lot of people, but I > > expect that sooner or later somebody will want more exact control. > > ack. OK. [,,,] > >> Considering the two statements: > >> - "The available steps are no larger than this value." > >> - "this value ... is not smaller than the apparent size of any individual rounding step" > >> > >> The "not larger" and "not smaller" sounds like all these words just end up saying that > >> this is the step size? > > > > They are intended to be the same statement: A <= B versus > > B >= A respectively. > > This is what I understood from the words ... and that made me think that it > can be simplified to A = B ... but no need to digress ... onto the alternatives below ... Right... [...] > > Instead, maybe we can just say something like: > > > > | The available steps are spaced at roughly equal intervals between the > > | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading > > | info/MB/bandwidth_gran gives the worst-case precision of these > > | interval steps, in per cent. > > > > What do you think? > > I find "worst-case precision" a bit confusing, consider for example, what > would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives > the upper limit of these interval steps"? I believe this matches what you > mentioned a couple of messages ago: "The available steps are no larger than this > value." Yes, that works. "Worst case" implies a value judgement that smaller steps are "better" then large steps, since the goal is control. But your wording, to the effect that this is the largest (apparent) step size, conveys all the needed information. > (and "per cent" -> "percent") ( Note: https://en.wiktionary.org/wiki/per_cent ) (Though either is acceptable, the fused word has a more informal feel to it for me. Happy to change it -- though your rewording below gets rid of it anyway. (This word doesn't appear in resctrl.rst -- evertying is "percentage" etc.) > > > If that's adequate, then the wording under the definition of > > "bandwidth_gran" could be aligned with this. > > I think putting together a couple of your proposals and statements while making the > text more accurate may work: > > "bandwidth_gran": > The approximate granularity in which the memory bandwidth > percentage is allocated. The allocated bandwidth percentage > is rounded up to the next control step available on the > hardware. The available hardware steps are no larger than > this value. That's better, thanks. I'm happy to pick this up and reword the text in both places along these lines. > I assume "available" is needed because, even though the steps are not larger > than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" > to 100% range? Yes -- or, rather, the steps _look_ inconsistent because they are rounded to exact percentages by the interface. I don't think we expect the actual steps in the hardware to be irregular. [...] > Reinette Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 12:43 ` Dave Martin @ 2025-09-29 15:38 ` Reinette Chatre 2025-09-29 16:10 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-09-29 15:38 UTC (permalink / raw) To: Dave Martin Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 9/29/25 5:43 AM, Dave Martin wrote: > Hi Reinette, > > On Thu, Sep 25, 2025 at 09:53:37PM +0100, Reinette Chatre wrote: >> Hi Dave, >> >> On 9/25/25 5:46 AM, Dave Martin wrote: >>> On Tue, Sep 23, 2025 at 10:27:40AM -0700, Reinette Chatre wrote: >>>> On 9/22/25 7:39 AM, Dave Martin wrote: >>>>> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >>>>>> Hi Dave, > > [...] > >>>>>> Also please use upper case for acronym mba->MBA. >>>>> >>>>> Ack (the local custom in the MPAM code is to use "mba", but arguably, >>>>> the meaning is not quite the same -- I'll change it.) >>>> >>>> I am curious what the motivation is for the custom? Knowing this will help >>>> me to keep things consistent when the two worlds meet. >>> >>> I think this has just evolved over time. On the x86 side, MBA is a >>> specific architectural feature, but on the MPAM side the architecture >>> doesn't really have a name for the same thing. Memory bandwidth is a >>> concept, but a few different types of control are defined for it, with >>> different names. >>> >>> So, for the MPAM driver "mba" is more of a software concept than >>> something in a published spec: it's the glue that attaches to "MB" >>> resource as seen through resctrl. >>> >>> (This isn't official though; it's just the mental model that I have >>> formed.) >> >> I see. Thank you for the details. My mental model is simpler: write acronyms >> in upper case. > > Generally, I agree, although I'm not sure whether that acronym belongs > in the MPAM-specific code. > > For this patch, though, that's irrelevant. I've changed it to "MBA" > as requested. > Thank you. ... >>>> Considering the two statements: >>>> - "The available steps are no larger than this value." >>>> - "this value ... is not smaller than the apparent size of any individual rounding step" >>>> >>>> The "not larger" and "not smaller" sounds like all these words just end up saying that >>>> this is the step size? >>> >>> They are intended to be the same statement: A <= B versus >>> B >= A respectively. >> >> This is what I understood from the words ... and that made me think that it >> can be simplified to A = B ... but no need to digress ... onto the alternatives below ... > > Right... > > [...] > >>> Instead, maybe we can just say something like: >>> >>> | The available steps are spaced at roughly equal intervals between the >>> | value reported by info/MB/min_bandwidth and 100%, inclusive. Reading >>> | info/MB/bandwidth_gran gives the worst-case precision of these >>> | interval steps, in per cent. >>> >>> What do you think? >> >> I find "worst-case precision" a bit confusing, consider for example, what >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives >> the upper limit of these interval steps"? I believe this matches what you >> mentioned a couple of messages ago: "The available steps are no larger than this >> value." > > Yes, that works. "Worst case" implies a value judgement that smaller > steps are "better" then large steps, since the goal is control. > > But your wording, to the effect that this is the largest (apparent) > step size, conveys all the needed information. Thank you for considering it. My preference is for stating things succinctly and not leave too much for interpretation. > >> (and "per cent" -> "percent") > > ( Note: https://en.wiktionary.org/wiki/per_cent ) Yes, in particular I note the "chiefly Commonwealth". I respect the differences in the English language and was easily convinced in earlier MPAM work to accept different spelling. I now regret doing so because after merge we now have a supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant to encounter when reading the code and also complicates new work. > (Though either is acceptable, the fused word has a more informal feel > to it for me. Happy to change it -- though your rewording below gets > rid of it anyway. (This word doesn't appear in resctrl.rst -- > evertying is "percentage" etc.) > >> >>> If that's adequate, then the wording under the definition of >>> "bandwidth_gran" could be aligned with this. >> >> I think putting together a couple of your proposals and statements while making the >> text more accurate may work: >> >> "bandwidth_gran": >> The approximate granularity in which the memory bandwidth >> percentage is allocated. The allocated bandwidth percentage >> is rounded up to the next control step available on the >> hardware. The available hardware steps are no larger than >> this value. > > That's better, thanks. I'm happy to pick this up and reword the text > in both places along these lines. Thank you very much. Please do feel free to rework. > >> I assume "available" is needed because, even though the steps are not larger >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" >> to 100% range? > > Yes -- or, rather, the steps _look_ inconsistent because they are > rounded to exact percentages by the interface. > > I don't think we expect the actual steps in the hardware to be > irregular. > Thank you for clarifying. Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 15:38 ` Reinette Chatre @ 2025-09-29 16:10 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-09-29 16:10 UTC (permalink / raw) To: Reinette Chatre Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Mon, Sep 29, 2025 at 08:38:13AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/29/25 5:43 AM, Dave Martin wrote: [...] > > Generally, I agree, although I'm not sure whether that acronym belongs > > in the MPAM-specific code. > > > > For this patch, though, that's irrelevant. I've changed it to "MBA" > > as requested. > > > > Thank you. [...] > >> I find "worst-case precision" a bit confusing, consider for example, what > >> would "best-case precision" be? What do you think of "info/MB/bandwidth_gran gives > >> the upper limit of these interval steps"? I believe this matches what you > >> mentioned a couple of messages ago: "The available steps are no larger than this > >> value." > > > > Yes, that works. "Worst case" implies a value judgement that smaller > > steps are "better" then large steps, since the goal is control. > > > > But your wording, to the effect that this is the largest (apparent) > > step size, conveys all the needed information. > > Thank you for considering it. My preference is for stating things succinctly > and not leave too much for interpretation. I find that it's not always easy to work out what information is essential without the discussion... so I hope that didn't feel like a waste of time! > >> (and "per cent" -> "percent") > > > > ( Note: https://en.wiktionary.org/wiki/per_cent ) > > Yes, in particular I note the "chiefly Commonwealth". I respect the differences > in the English language and was easily convinced in earlier MPAM work to > accept different spelling. I now regret doing so because after merge we now have a > supposedly coherent resctrl codebase with inconsistent spelling that is unpleasant > to encounter when reading the code and also complicates new work. > > > (Though either is acceptable, the fused word has a more informal feel > > to it for me. Happy to change it -- though your rewording below gets > > rid of it anyway. (This word doesn't appear in resctrl.rst -- > > evertying is "percentage" etc.) Sure, it's best not to have mixed-up conventions in the same document. (With this one, I wasn't aware that there were regional differences at all, so that was news to me...) [...] > >> I think putting together a couple of your proposals and statements while making the > >> text more accurate may work: > >> > >> "bandwidth_gran": > >> The approximate granularity in which the memory bandwidth > >> percentage is allocated. The allocated bandwidth percentage > >> is rounded up to the next control step available on the > >> hardware. The available hardware steps are no larger than > >> this value. > > > > That's better, thanks. I'm happy to pick this up and reword the text > > in both places along these lines. > > Thank you very much. Please do feel free to rework. > > > > >> I assume "available" is needed because, even though the steps are not larger > >> than "bandwidth_gran", the steps may not be consistent across the "min_bandwidth" > >> to 100% range? > > > > Yes -- or, rather, the steps _look_ inconsistent because they are > > rounded to exact percentages by the interface. > > > > I don't think we expect the actual steps in the hardware to be > > irregular. > > > Thank you for clarifying. > > Reinette OK. I'll tidy up the loose ends and repost once I've have a chance to re- test. Thanks for the review. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-22 14:39 ` Dave Martin 2025-09-23 17:27 ` Reinette Chatre @ 2025-10-15 15:18 ` Dave Martin 2025-10-16 15:57 ` Reinette Chatre 1 sibling, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-10-15 15:18 UTC (permalink / raw) To: Reinette Chatre Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, Just following up on the skipped L2_NONCONT_CAT test -- see below. [...] On Mon, Sep 22, 2025 at 03:39:47PM +0100, Dave Martin wrote: [...] > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: [...] > > On 9/2/25 9:24 AM, Dave Martin wrote: [...] > > > Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ > > > the other tests except for the NONCONT_CAT tests, which do not seem to > > > be supported in my configuration -- and have nothing to do with the > > > code touched by this patch). > > > > Is the NONCONT_CAT test failing (i.e printing "not ok")? > > > > The NONCONT_CAT tests may print error messages as debug information as part of > > running, but these errors are expected as part of the test. The test should accurately > > state whether it passed or failed though. For example, below attempts to write > > a non-contiguous CBM to a system that does not support non-contiguous masks. > > This fails as expected, error messages printed as debugging and thus the test passes > > with an "ok". > > > > # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument > > # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected > > ok 5 L3_NONCONT_CAT: test > > I don't think that this was anything to do with my changes, but I don't > still seem to have the test output. (Since this test has to do with > bitmap schemata (?), it seemed unlikely to be affected by changes to > bw_validate().) > > I'll need to re-test with and without this patch to check whether it > makes any difference. I finally got around to testing this on top of -rc1. Disregarding trivial differences, the patched version (+++) doesn't seem to introduce any regressions over the vanilla version (---) (below). (The CMT test actually failed with an out-of-tolerance result on the vanilla kernel only. Possibly there was some adverse system load interfering.) Looking at the code, it seems that L2_NONCONT_CAT is not gated by any config or mount option. I think this is just a feature that my hardware doesn't support (?) arch/x86/kernel/cpu/resctrl/core.c has: | static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r) | { [...] | if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) | r->cache.arch_has_sparse_bitmasks = ecx.split.noncont; [...] | } Cheers ---Dave Full diff of the test output: --- base/resctrl_tests_6.18.0-rc1.out 2025-10-14 17:11:56.000000000 +0100 +++ test1/resctrl_tests_6.18.0-rc1-test1.out 2025-10-14 17:21:44.000000000 +0100 @@ -1,132 +1,132 @@ TAP version 13 # Pass: Check kernel supports resctrl filesystem # Pass: Check resctrl mountpoint "/sys/fs/resctrl" exists # resctrl filesystem not mounted -# dmesg: [ 1.409003] resctrl: L3 allocation detected -# dmesg: [ 1.409040] resctrl: MB allocation detected -# dmesg: [ 1.409072] resctrl: L3 monitoring detected +# dmesg: [ 1.411733] resctrl: L3 allocation detected +# dmesg: [ 1.411792] resctrl: MB allocation detected +# dmesg: [ 1.411831] resctrl: L3 monitoring detected 1..6 # Starting MBM test ... # Mounting resctrl to "/sys/fs/resctrl" # Writing benchmark parameters to resctrl FS -# Benchmark PID: 5126 +# Benchmark PID: 4954 # Write schema "MB:0=100" to resctrl FS # Checking for pass/fail # Pass: Check MBM diff within 8% # avg_diff_per: 0% # Span (MB): 250 -# avg_bw_imc: 6422 -# avg_bw_resc: 6392 +# avg_bw_imc: 6886 +# avg_bw_resc: 6943 ok 1 MBM: test # Starting MBA test ... # Mounting resctrl to "/sys/fs/resctrl" # Writing benchmark parameters to resctrl FS -# Benchmark PID: 5129 +# Benchmark PID: 4957 # Write schema "MB:0=10" to resctrl FS # Write schema "MB:0=20" to resctrl FS # Write schema "MB:0=30" to resctrl FS # Write schema "MB:0=40" to resctrl FS # Write schema "MB:0=50" to resctrl FS # Write schema "MB:0=60" to resctrl FS # Write schema "MB:0=70" to resctrl FS # Write schema "MB:0=80" to resctrl FS # Write schema "MB:0=90" to resctrl FS # Write schema "MB:0=100" to resctrl FS # Results are displayed in (MB) # Pass: Check MBA diff within 8% for schemata 10 -# avg_diff_per: 1% -# avg_bw_imc: 2033 -# avg_bw_resc: 2012 +# avg_diff_per: 0% +# avg_bw_imc: 2028 +# avg_bw_resc: 2032 # Pass: Check MBA diff within 8% for schemata 20 # avg_diff_per: 0% -# avg_bw_imc: 3028 -# avg_bw_resc: 3005 +# avg_bw_imc: 3006 +# avg_bw_resc: 3011 # Pass: Check MBA diff within 8% for schemata 30 # avg_diff_per: 0% -# avg_bw_imc: 3982 -# avg_bw_resc: 3958 +# avg_bw_imc: 4006 +# avg_bw_resc: 4013 # Pass: Check MBA diff within 8% for schemata 40 # avg_diff_per: 0% -# avg_bw_imc: 6265 -# avg_bw_resc: 6236 +# avg_bw_imc: 6726 +# avg_bw_resc: 6732 # Pass: Check MBA diff within 8% for schemata 50 # avg_diff_per: 0% -# avg_bw_imc: 6384 -# avg_bw_resc: 6355 +# avg_bw_imc: 6854 +# avg_bw_resc: 6856 # Pass: Check MBA diff within 8% for schemata 60 # avg_diff_per: 0% -# avg_bw_imc: 6405 -# avg_bw_resc: 6376 +# avg_bw_imc: 6882 +# avg_bw_resc: 6883 # Pass: Check MBA diff within 8% for schemata 70 # avg_diff_per: 0% -# avg_bw_imc: 6417 -# avg_bw_resc: 6387 +# avg_bw_imc: 6891 +# avg_bw_resc: 6889 # Pass: Check MBA diff within 8% for schemata 80 # avg_diff_per: 0% -# avg_bw_imc: 6418 -# avg_bw_resc: 6394 +# avg_bw_imc: 6893 +# avg_bw_resc: 6909 # Pass: Check MBA diff within 8% for schemata 90 # avg_diff_per: 0% -# avg_bw_imc: 6412 -# avg_bw_resc: 6384 +# avg_bw_imc: 6890 +# avg_bw_resc: 6888 # Pass: Check MBA diff within 8% for schemata 100 # avg_diff_per: 0% -# avg_bw_imc: 6425 -# avg_bw_resc: 6399 +# avg_bw_imc: 6929 +# avg_bw_resc: 6951 # Pass: Check schemata change using MBA ok 2 MBA: test # Starting CMT test ... # Mounting resctrl to "/sys/fs/resctrl" # Cache size :23068672 # Writing benchmark parameters to resctrl FS -# Benchmark PID: 5135 +# Benchmark PID: 4970 # Checking for pass/fail -# Fail: Check cache miss rate within 15% -# Percent diff=24 +# Pass: Check cache miss rate within 15% +# Percent diff=4 # Number of bits: 5 -# Average LLC val: 7942963 +# Average LLC val: 10918297 # Cache span (bytes): 10485760 -not ok 3 CMT: test +ok 3 CMT: test # Starting L3_CAT test ... # Mounting resctrl to "/sys/fs/resctrl" # Cache size :23068672 # Writing benchmark parameters to resctrl FS # Write schema "L3:0=1f0" to resctrl FS # Write schema "L3:0=f" to resctrl FS # Write schema "L3:0=1f8" to resctrl FS # Write schema "L3:0=7" to resctrl FS # Write schema "L3:0=1fc" to resctrl FS # Write schema "L3:0=3" to resctrl FS # Write schema "L3:0=1fe" to resctrl FS # Write schema "L3:0=1" to resctrl FS # Checking for pass/fail # Number of bits: 4 -# Average LLC val: 71434 +# Average LLC val: 70161 # Cache span (lines): 131072 # Pass: Check cache miss rate changed more than 2.0% -# Percent diff=70.0 +# Percent diff=72.1 # Number of bits: 3 -# Average LLC val: 121463 +# Average LLC val: 120755 # Cache span (lines): 98304 # Pass: Check cache miss rate changed more than 1.0% -# Percent diff=40.8 +# Percent diff=42.5 # Number of bits: 2 -# Average LLC val: 170978 +# Average LLC val: 172077 # Cache span (lines): 65536 # Pass: Check cache miss rate changed more than 0.0% -# Percent diff=22.8 +# Percent diff=22.0 # Number of bits: 1 -# Average LLC val: 209950 +# Average LLC val: 209893 # Cache span (lines): 32768 ok 4 L3_CAT: test # Starting L3_NONCONT_CAT test ... # Mounting resctrl to "/sys/fs/resctrl" # Write schema "L3:0=3f" to resctrl FS # Write schema "L3:0=787" to resctrl FS # write() failed : Invalid argument # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected ok 5 L3_NONCONT_CAT: test # Starting L2_NONCONT_CAT test ... # Mounting resctrl to "/sys/fs/resctrl" ok 6 # SKIP Hardware does not support L2_NONCONT_CAT or L2_NONCONT_CAT is disabled # 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage. -# Totals: pass:4 fail:1 xfail:0 xpass:0 skip:1 error:0 +# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:1 error:0 --- ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-15 15:18 ` Dave Martin @ 2025-10-16 15:57 ` Reinette Chatre 2025-10-17 15:52 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-10-16 15:57 UTC (permalink / raw) To: Dave Martin Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 10/15/25 8:18 AM, Dave Martin wrote: > Hi Reinette, > > Just following up on the skipped L2_NONCONT_CAT test -- see below. Thank you very much. > > [...] > > On Mon, Sep 22, 2025 at 03:39:47PM +0100, Dave Martin wrote: > > [...] > >> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > > [...] > >>> On 9/2/25 9:24 AM, Dave Martin wrote: > > [...] > >>>> Testing: the resctrl MBA and MBM tests pass on a random x86 machine (+ >>>> the other tests except for the NONCONT_CAT tests, which do not seem to >>>> be supported in my configuration -- and have nothing to do with the >>>> code touched by this patch). >>> >>> Is the NONCONT_CAT test failing (i.e printing "not ok")? >>> >>> The NONCONT_CAT tests may print error messages as debug information as part of >>> running, but these errors are expected as part of the test. The test should accurately >>> state whether it passed or failed though. For example, below attempts to write >>> a non-contiguous CBM to a system that does not support non-contiguous masks. >>> This fails as expected, error messages printed as debugging and thus the test passes >>> with an "ok". >>> >>> # Write schema "L3:0=ff0ff" to resctrl FS # write() failed : Invalid argument >>> # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected >>> ok 5 L3_NONCONT_CAT: test >> >> I don't think that this was anything to do with my changes, but I don't >> still seem to have the test output. (Since this test has to do with >> bitmap schemata (?), it seemed unlikely to be affected by changes to >> bw_validate().) >> >> I'll need to re-test with and without this patch to check whether it >> makes any difference. > > I finally got around to testing this on top of -rc1. > > Disregarding trivial differences, the patched version (+++) doesn't > seem to introduce any regressions over the vanilla version (---) > (below). (The CMT test actually failed with an out-of-tolerance result > on the vanilla kernel only. Possibly there was some adverse system > load interfering.) My first thought is that this is another unfortunate consequence of the resctrl performance-as-functional tests. The percentage difference you encountered is quite large and that prompted me to take a closer look and it does look to me as though the CMT can be improved. (Whether we should spend more effort on these performance tests instead of creating new deterministic functional tests is another topic.) > > > Looking at the code, it seems that L2_NONCONT_CAT is not gated by any > config or mount option. I think this is just a feature that my > hardware doesn't support (?) Yes, this is how I also interpret the test output. Focusing on the CMT test ... > # Starting CMT test ... > # Mounting resctrl to "/sys/fs/resctrl" > # Cache size :23068672 > # Writing benchmark parameters to resctrl FS > -# Benchmark PID: 5135 > +# Benchmark PID: 4970 > # Checking for pass/fail > -# Fail: Check cache miss rate within 15% > -# Percent diff=24 > +# Pass: Check cache miss rate within 15% > +# Percent diff=4 > # Number of bits: 5 > -# Average LLC val: 7942963 > +# Average LLC val: 10918297 > # Cache span (bytes): 10485760 > -not ok 3 CMT: test > +ok 3 CMT: test A 24% difference followed by a 4% difference is a big swing. On a high level the CMT test creates a new resource group with only the test assigned to it. The test initializes and accesses a buffer a couple of time while measuring cache occupancy. "success" is when the cache occupancy is within 15% of the buffer size. I noticed a couple of places where the test is susceptible to interference and system architecture. 1) The cache allocation of test's resource group overlaps with the rest of the system. On a busy system it is thus likely that the test's cache entries may be evicted. 2) The test does not account for cache architecture where, for example, there may be an L2 cache that can accommodate a large part of the buffer and thus not be reflected in the LLC occupancy count. I started experimenting to see what it will take to reduce interference and ended up with a change like below that isolates the cache portions between the test and the rest of the system and if L2 cache allocation is possible, reduces the amount of L2 cache the test can allocate into as much as possible. This opened up another tangent where the size of cache portion is the same as the buffer while it is not realistic to expect a user space buffer to fill into the cache so nicely. Even with these changes I was not able to get the percentages to drop significantly on my system but it may help to reduce the swings in numbers observed. But, I do not see how work like this helps to improve resctrl health (compared to, for example, just increasing the "success" percentage). diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index d09e693dc739..494e98aa8b69 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -19,12 +19,22 @@ #define CON_MON_LCC_OCCUP_PATH \ "%s/%s/mon_data/mon_L3_%02d/llc_occupancy" -static int cmt_init(const struct resctrl_val_param *param, int domain_id) +static int cmt_init(const struct resctrl_test *test, + const struct user_params *uparams, + const struct resctrl_val_param *param, int domain_id) { + char schemata[64]; + int ret; + sprintf(llc_occup_path, CON_MON_LCC_OCCUP_PATH, RESCTRL_PATH, param->ctrlgrp, domain_id); - return 0; + snprintf(schemata, sizeof(schemata), "%lx", param->mask); + ret = write_schemata(param->ctrlgrp, schemata, uparams->cpu, test->resource); + if (!ret && !strcmp(test->resource, "L3") && resctrl_resource_exists("L2")) + ret = write_schemata(param->ctrlgrp, "0x1", uparams->cpu, "L2"); + + return ret; } static int cmt_setup(const struct resctrl_test *test, @@ -119,6 +129,7 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param unsigned long cache_total_size = 0; int n = uparams->bits ? : 5; unsigned long long_mask; + char schemata[64]; int count_of_bits; size_t span; int ret; @@ -162,6 +173,11 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param param.fill_buf = &fill_buf; } + snprintf(schemata, sizeof(schemata), "%lx", ~param.mask & long_mask); + ret = write_schemata("", schemata, uparams->cpu, test->resource); + if (ret) + return ret; + remove(RESULT_FILE_NAME); ret = resctrl_val(test, uparams, ¶m); diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index c7e9adc0368f..cd4c715b7ffd 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -17,7 +17,9 @@ #define ALLOCATION_MIN 10 #define ALLOCATION_STEP 10 -static int mba_init(const struct resctrl_val_param *param, int domain_id) +static int mba_init(const struct resctrl_test *test, + const struct user_params *uparams, + const struct resctrl_val_param *param, int domain_id) { int ret; diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index 84d8bc250539..58201f844740 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -83,7 +83,9 @@ static int check_results(size_t span) return ret; } -static int mbm_init(const struct resctrl_val_param *param, int domain_id) +static int mbm_init(const struct resctrl_test *test, + const struct user_params *uparams, + const struct resctrl_val_param *param, int domain_id) { int ret; diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index cd3adfc14969..9853bd746392 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -133,7 +133,9 @@ struct resctrl_val_param { char filename[64]; unsigned long mask; int num_of_runs; - int (*init)(const struct resctrl_val_param *param, + int (*init)(const struct resctrl_test *test, + const struct user_params *uparams, + const struct resctrl_val_param *param, int domain_id); int (*setup)(const struct resctrl_test *test, const struct user_params *uparams, diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 7c08e936572d..a5a8badb83d4 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -569,7 +569,7 @@ int resctrl_val(const struct resctrl_test *test, goto reset_affinity; if (param->init) { - ret = param->init(param, domain_id); + ret = param->init(test, uparams, param, domain_id); if (ret) goto reset_affinity; } ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-16 15:57 ` Reinette Chatre @ 2025-10-17 15:52 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-10-17 15:52 UTC (permalink / raw) To: Reinette Chatre Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi there, On Thu, Oct 16, 2025 at 08:57:59AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 10/15/25 8:18 AM, Dave Martin wrote: [...] > > I finally got around to testing this on top of -rc1. > > > > Disregarding trivial differences, the patched version (+++) doesn't > > seem to introduce any regressions over the vanilla version (---) > > (below). (The CMT test actually failed with an out-of-tolerance result > > on the vanilla kernel only. Possibly there was some adverse system > > load interfering.) > > My first thought is that this is another unfortunate consequence of the resctrl > performance-as-functional tests. > The percentage difference you encountered is quite large and that > prompted me to take a closer look and it does look to me as though the CMT > can be improved. (Whether we should spend more effort on these performance tests > instead of creating new deterministic functional tests is another topic.) I ran the tests soon after booting a full-fat OS, which may not have helped. In an ideal world we sort of want two RDT instances, one under test, and one outside it to isolate it from the rest of the system... but that would require extra hardware :( I'll aim to just boot to a shell the next time I run these tests. > > Looking at the code, it seems that L2_NONCONT_CAT is not gated by any > > config or mount option. I think this is just a feature that my > > hardware doesn't support (?) > > Yes, this is how I also interpret the test output. > > Focusing on the CMT test ... > > > # Starting CMT test ... > > # Mounting resctrl to "/sys/fs/resctrl" > > # Cache size :23068672 > > # Writing benchmark parameters to resctrl FS > > -# Benchmark PID: 5135 > > +# Benchmark PID: 4970 > > # Checking for pass/fail > > -# Fail: Check cache miss rate within 15% > > -# Percent diff=24 > > +# Pass: Check cache miss rate within 15% > > +# Percent diff=4 > > # Number of bits: 5 > > -# Average LLC val: 7942963 > > +# Average LLC val: 10918297 > > # Cache span (bytes): 10485760 > > -not ok 3 CMT: test > > +ok 3 CMT: test > > A 24% difference followed by a 4% difference is a big swing. On a high level > the CMT test creates a new resource group with only the test assigned to it. The test > initializes and accesses a buffer a couple of time while measuring cache occupancy. > "success" is when the cache occupancy is within 15% of the buffer size. > > I noticed a couple of places where the test is susceptible to interference and > system architecture. > 1) The cache allocation of test's resource group overlaps with the rest of the > system. On a busy system it is thus likely that the test's cache entries may be > evicted. > 2) The test does not account for cache architecture where, for example, there may be > an L2 cache that can accommodate a large part of the buffer and thus not be > reflected in the LLC occupancy count. I suppose the test could take all the cache sizes into account, but this could get complicated -- for a statistical test, it may not be worth it. Can we probe it just by setting the CAT mask to all ones (maybe excluding one bit to give to the default control group) and then flooding with data until the LLC usage stops increasing? > I started experimenting to see what it will take to reduce interference and ended up > with a change like below that isolates the cache portions between the test and the > rest of the system and if L2 cache allocation is possible, reduces the amount of L2 > cache the test can allocate into as much as possible. This opened up another tangent > where the size of cache portion is the same as the buffer while it is not realistic > to expect a user space buffer to fill into the cache so nicely. > > Even with these changes I was not able to get the percentages to drop significantly > on my system but it may help to reduce the swings in numbers observed. > > But, I do not see how work like this helps to improve resctrl health (compared to, > for example, just increasing the "success" percentage). If I understand correctly, this programs the default control group with the inverse of the test's CAT mask, so that there is no overlap? That seems reasonable, if so. I wonder whether bandwidth contention is also having an effect. Would programming the default control and test control group with MB values that don't add up to more than 100% help? Cheers ---Dave > diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c > index d09e693dc739..494e98aa8b69 100644 > --- a/tools/testing/selftests/resctrl/cmt_test.c > +++ b/tools/testing/selftests/resctrl/cmt_test.c > @@ -19,12 +19,22 @@ > #define CON_MON_LCC_OCCUP_PATH \ > "%s/%s/mon_data/mon_L3_%02d/llc_occupancy" > > -static int cmt_init(const struct resctrl_val_param *param, int domain_id) > +static int cmt_init(const struct resctrl_test *test, > + const struct user_params *uparams, > + const struct resctrl_val_param *param, int domain_id) > { > + char schemata[64]; > + int ret; > + > sprintf(llc_occup_path, CON_MON_LCC_OCCUP_PATH, RESCTRL_PATH, > param->ctrlgrp, domain_id); > > - return 0; > + snprintf(schemata, sizeof(schemata), "%lx", param->mask); > + ret = write_schemata(param->ctrlgrp, schemata, uparams->cpu, test->resource); > + if (!ret && !strcmp(test->resource, "L3") && resctrl_resource_exists("L2")) > + ret = write_schemata(param->ctrlgrp, "0x1", uparams->cpu, "L2"); > + > + return ret; > } [...] > @@ -162,6 +173,11 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param [...] > + snprintf(schemata, sizeof(schemata), "%lx", ~param.mask & long_mask); > + ret = write_schemata("", schemata, uparams->cpu, test->resource); > + if (ret) > + return ret; > + > remove(RESULT_FILE_NAME); > > ret = resctrl_val(test, uparams, ¶m); > diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c [...] ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin 2025-09-12 22:19 ` Reinette Chatre @ 2025-09-22 15:04 ` Dave Martin 2025-09-25 22:58 ` Luck, Tony 2025-09-26 20:54 ` Reinette Chatre 1 sibling, 2 replies; 52+ messages in thread From: Dave Martin @ 2025-09-22 15:04 UTC (permalink / raw) To: linux-kernel Cc: Tony Luck, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi again, On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: [...] > > Clamping to bw_min and bw_max still feels generic: leave it in the core > > code, for now. > > Sounds like MPAM may be ready to start the schema parsing discussion again? > I understand that MPAM has a few more ways to describe memory bandwidth as > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > schema format to user space, which seems like a good idea for new schema. On this topic, specifically: My own ideas in this area are a little different, though I agree with the general idea. Bitmap controls are distinct from numeric values, but for numbers, I'm not sure that distinguishing percentages from other values is required, since this is really just a specific case of a linear scale. I imagined a generic numeric schema, described by a set of files like the following in a schema's info directory: min: minimum value, e.g., 1 max: maximum value, e.g., 1023 scale: value that corresponds to one unit unit: quantified base unit, e.g., "100pc", "64MBps" map: mapping function name If s is the value written in a schemata entry and p is the corresponding physical amount of resource, then min <= s <= max and p = map(s / scale) * unit One reason why I prefer this scaling scheme over the floating-point approach is that it can be exact (at least for currently known platforms), and it doesn't require a new floating-point parser/ formatter to be written for this one thing in the kernel (which I suspect is likely to be error-prone and poorly defined around subtleties such as rounding behaviour). "map" anticipates non-linear ramps, but this is only really here as a forwards compatibility get-out. For now, this might just be set to "none", meaning the identity mapping (i.e., a no-op). This may shadow the existing the "delay_linear" parameter, but with more general applicabillity if we need it. The idea is that userspace reads the info files and then does the appropriate conversions itself. This might or might not be seen as a burden, but would give exact control over the hardware configuration with a generic interface, with possibly greater precision than the existing schemata allow (when the hardware supports it), and without having to second-guess the rounding that the kernel may or may not do on the values. For RDT MBA, we might have min: 10 max: 100 scale: 100 unit: 100pc map: none The schemata entry MB: 0=10, 1=100 would allocate the minimum possible bandwidth to domain 0, and 100% bandwidth to domain 1. For AMD SMBA, we might have: min: 1 max: 100 scale: 8 unit: 1GBps (if I've understood this correctly from resctrl.rst.) For MPAM MBW_MAX with, say, 6 bits of resolution, we might have: min: 1 max: 64 scale: 64 unit: 100pc map: none The schemata entry MB: 0=1,1=64 would allocate the minimum possible bandwidth to domain 0, and 100% bandwidth to domain 1. This would probably need to be a new schema, since we already have "MB" mimicking x86. Exposing the hardware scale in this way would give userspace precise control (including in sub-1% increments on capable hardware), without having to second-guess the way the kernel will round the values. > Is this something MPAM is still considering? For example, the minimum > and maximum ranges that can be specified, is this something you already > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes > discussion on how to handle min/max ranges for bandwidth? This seems to be a different thing. I think James had some thoughts on this already -- I haven't checked on his current idea, but one option would be simply to expose this as two distinct schemata, say MB_MIN, MB_MAX. There's a question of how to cope with multiple different schemata entries that shadow each other (i.e., control the same hardware resource). Would something like the following work? A read from schemata might produce something like this: MB: 0=50, 1=50 # MB_HW: 0=32, 1=32 # MB_MIN: 0=31, 1=31 # MB_MAX: 0=32, 1=32 (Where MB_HW is the MPAM schema with 6-bit resolution that I illustrated above, and MB_MIN and MB_MAX are similar schemata for the specific MIN and MAX controls in the hardware.) Userspace that does not understand the new entries would need to ignore the commented lines, but can otherwise safely alter and write back the schemata with the expected results. The kernel would in turn ignore the commented lines on write. The commented lines are meaningful but "inactive": they describe the current hardware configuration on read, but (unless explicitly uncommented) won't change anything on write. Software that understands the new entries can uncomment the conflicting entries and write them back instead of (or in addition to) the conflicting entries. For example, userspace might write the following: MB_MIN: 0=16, 1=16 MB_MAX: 0=32, 1=32 Which might then read back as follows: MB: 0=50, 1=50 # MB_HW: 0=32, 1=32 # MB_MIN: 0=16, 1=16 # MB_MAX: 0=32, 1=32 I haven't tried to develop this idea further, for now. I'd be interested in people's thoughts on it, though. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-22 15:04 ` Dave Martin @ 2025-09-25 22:58 ` Luck, Tony 2025-09-29 9:19 ` Chen, Yu C 2025-09-29 13:56 ` Dave Martin 2025-09-26 20:54 ` Reinette Chatre 1 sibling, 2 replies; 52+ messages in thread From: Luck, Tony @ 2025-09-25 22:58 UTC (permalink / raw) To: Dave Martin Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > Hi again, > > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > > [...] > > > > Clamping to bw_min and bw_max still feels generic: leave it in the core > > > code, for now. > > > > Sounds like MPAM may be ready to start the schema parsing discussion again? > > I understand that MPAM has a few more ways to describe memory bandwidth as > > well as cache portion partitioning. Previously ([1] [2]) James mused about exposing > > schema format to user space, which seems like a good idea for new schema. > > On this topic, specifically: > > > My own ideas in this area are a little different, though I agree with > the general idea. > > Bitmap controls are distinct from numeric values, but for numbers, I'm > not sure that distinguishing percentages from other values is required, > since this is really just a specific case of a linear scale. > > I imagined a generic numeric schema, described by a set of files like > the following in a schema's info directory: > > min: minimum value, e.g., 1 > max: maximum value, e.g., 1023 > scale: value that corresponds to one unit > unit: quantified base unit, e.g., "100pc", "64MBps" > map: mapping function name > > If s is the value written in a schemata entry and p is the > corresponding physical amount of resource, then > > min <= s <= max > > and > > p = map(s / scale) * unit > > One reason why I prefer this scaling scheme over the floating-point > approach is that it can be exact (at least for currently known > platforms), and it doesn't require a new floating-point parser/ > formatter to be written for this one thing in the kernel (which I > suspect is likely to be error-prone and poorly defined around > subtleties such as rounding behaviour). > > "map" anticipates non-linear ramps, but this is only really here as a > forwards compatibility get-out. For now, this might just be set to > "none", meaning the identity mapping (i.e., a no-op). This may shadow > the existing the "delay_linear" parameter, but with more general > applicabillity if we need it. > > > The idea is that userspace reads the info files and then does the > appropriate conversions itself. This might or might not be seen as a > burden, but would give exact control over the hardware configuration > with a generic interface, with possibly greater precision than the > existing schemata allow (when the hardware supports it), and without > having to second-guess the rounding that the kernel may or may not do > on the values. > > For RDT MBA, we might have > > min: 10 > max: 100 > scale: 100 > unit: 100pc > map: none > > The schemata entry > > MB: 0=10, 1=100 > > would allocate the minimum possible bandwidth to domain 0, and 100% > bandwidth to domain 1. > > > For AMD SMBA, we might have: > > min: 1 > max: 100 > scale: 8 > unit: 1GBps > > (if I've understood this correctly from resctrl.rst.) > > > For MPAM MBW_MAX with, say, 6 bits of resolution, we might have: > > min: 1 > max: 64 > scale: 64 > unit: 100pc > map: none > > The schemata entry > > MB: 0=1,1=64 > > would allocate the minimum possible bandwidth to domain 0, and 100% > bandwidth to domain 1. This would probably need to be a new schema, > since we already have "MB" mimicking x86. > > Exposing the hardware scale in this way would give userspace precise > control (including in sub-1% increments on capable hardware), without > having to second-guess the way the kernel will round the values. > > > > Is this something MPAM is still considering? For example, the minimum > > and maximum ranges that can be specified, is this something you already > > have some ideas for? Have you perhaps considered Tony's RFD [3] that includes > > discussion on how to handle min/max ranges for bandwidth? > > This seems to be a different thing. I think James had some thoughts on > this already -- I haven't checked on his current idea, but one option > would be simply to expose this as two distinct schemata, say MB_MIN, > MB_MAX. > > There's a question of how to cope with multiple different schemata > entries that shadow each other (i.e., control the same hardware > resource). > > > Would something like the following work? A read from schemata might > produce something like this: > > MB: 0=50, 1=50 > # MB_HW: 0=32, 1=32 > # MB_MIN: 0=31, 1=31 > # MB_MAX: 0=32, 1=32 > > (Where MB_HW is the MPAM schema with 6-bit resolution that I > illustrated above, and MB_MIN and MB_MAX are similar schemata for the > specific MIN and MAX controls in the hardware.) > > Userspace that does not understand the new entries would need to ignore > the commented lines, but can otherwise safely alter and write back the > schemata with the expected results. The kernel would in turn ignore > the commented lines on write. The commented lines are meaningful but > "inactive": they describe the current hardware configuration on read, > but (unless explicitly uncommented) won't change anything on write. > > Software that understands the new entries can uncomment the conflicting > entries and write them back instead of (or in addition to) the > conflicting entries. For example, userspace might write the following: > > MB_MIN: 0=16, 1=16 > MB_MAX: 0=32, 1=32 > > Which might then read back as follows: > > MB: 0=50, 1=50 > # MB_HW: 0=32, 1=32 > # MB_MIN: 0=16, 1=16 > # MB_MAX: 0=32, 1=32 > > > I haven't tried to develop this idea further, for now. > > I'd be interested in people's thoughts on it, though. Applying this to Intel upcoming region aware memory bandwidth that supports 255 steps and h/w min/max limits. We would have info files with "min = 1, max = 255" and a schemata file that looks like this to legacy apps: MB: 0=50;1=75 #MB_HW: 0=128;1=191 #MB_MIN: 0=128;1=191 #MB_MAX: 0=128;1=191 But a newer app that is aware of the extensions can write: # cat > schemata << 'EOF' MB_HW: 0=10 MB_MIN: 0=10 MB_MAX: 0=64 EOF which then reads back as: MB: 0=4;1=75 #MB_HW: 0=10;1=191 #MB_MIN: 0=10;1=191 #MB_MAX: 0=64;1=191 with the legacy line updated with the rounded value of the MB_HW supplied by the user. 10/255 = 3.921% ... so call it "4". The region aware h/w supports separate bandwidth controls for each region. We could hope (or perhaps update the spec to define) that region0 is always node-local DDR memory and keep the "MB" tag for that. Then use some other tag naming for other regions. Remote DDR, local CXL, remote CXL are the ones we think are next in the h/w memory sequence. But the "region" concept would allow for other options as other memory technologies come into use. > > Cheers > ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-25 22:58 ` Luck, Tony @ 2025-09-29 9:19 ` Chen, Yu C 2025-09-29 14:13 ` Dave Martin 2025-09-29 13:56 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Chen, Yu C @ 2025-09-29 9:19 UTC (permalink / raw) To: Luck, Tony Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc, Dave Martin On 9/26/2025 6:58 AM, Luck, Tony wrote: > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: >> Hi again, >> >> On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: >> >> [...] >> [snip] >> For example, userspace might write the following: >> >> MB_MIN: 0=16, 1=16 >> MB_MAX: 0=32, 1=32 >> >> Which might then read back as follows: >> >> MB: 0=50, 1=50 >> # MB_HW: 0=32, 1=32 >> # MB_MIN: 0=16, 1=16 >> # MB_MAX: 0=32, 1=32 >> >> >> I haven't tried to develop this idea further, for now. >> >> I'd be interested in people's thoughts on it, though. > > Applying this to Intel upcoming region aware memory bandwidth > that supports 255 steps and h/w min/max limits. > We would have info files with "min = 1, max = 255" and a schemata > file that looks like this to legacy apps: > > MB: 0=50;1=75 > #MB_HW: 0=128;1=191 > #MB_MIN: 0=128;1=191 > #MB_MAX: 0=128;1=191 > > But a newer app that is aware of the extensions can write: > > # cat > schemata << 'EOF' > MB_HW: 0=10 > MB_MIN: 0=10 > MB_MAX: 0=64 > EOF > > which then reads back as: > MB: 0=4;1=75 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > with the legacy line updated with the rounded value of the MB_HW > supplied by the user. 10/255 = 3.921% ... so call it "4". > This seems to be applicable as it introduces the new interface while preserving forward compatibility. One minor question is that, according to "Figure 6-5. MBA Optimal Bandwidth Register" in the latest RDT specification, the maximum value ranges from 1 to 511. Additionally, this bandwidth field is located at bits 48 to 56 in the MBA Optimal Bandwidth Register, and the range for this segment could be 1 to 8191. Just wonder if it would be possible that the current maximum value of 512 may be extended in the future? Perhaps we could explore a method to query the maximum upper limit from the ACPI table or register, or use CPUID to distinguish between platforms rather than hardcoding it. Reinette also mentioned this in another thread. Thanks, Chenyu [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html > The region aware h/w supports separate bandwidth controls for each > region. We could hope (or perhaps update the spec to define) that > region0 is always node-local DDR memory and keep the "MB" tag for > that. > > Then use some other tag naming for other regions. Remote DDR, > local CXL, remote CXL are the ones we think are next in the h/w > memory sequence. But the "region" concept would allow for other > options as other memory technologies come into use. > >> >> Cheers >> ---Dave > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 9:19 ` Chen, Yu C @ 2025-09-29 14:13 ` Dave Martin 2025-09-29 16:23 ` Luck, Tony 2025-09-30 4:43 ` Chen, Yu C 0 siblings, 2 replies; 52+ messages in thread From: Dave Martin @ 2025-09-29 14:13 UTC (permalink / raw) To: Chen, Yu C Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi there, On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote: > On 9/26/2025 6:58 AM, Luck, Tony wrote: [...] > > Applying this to Intel upcoming region aware memory bandwidth > > that supports 255 steps and h/w min/max limits. > > We would have info files with "min = 1, max = 255" and a schemata > > file that looks like this to legacy apps: > > > > MB: 0=50;1=75 > > #MB_HW: 0=128;1=191 > > #MB_MIN: 0=128;1=191 > > #MB_MAX: 0=128;1=191 > > > > But a newer app that is aware of the extensions can write: > > > > # cat > schemata << 'EOF' > > MB_HW: 0=10 > > MB_MIN: 0=10 > > MB_MAX: 0=64 > > EOF > > > > which then reads back as: > > MB: 0=4;1=75 > > #MB_HW: 0=10;1=191 > > #MB_MIN: 0=10;1=191 > > #MB_MAX: 0=64;1=191 > > > > with the legacy line updated with the rounded value of the MB_HW > > supplied by the user. 10/255 = 3.921% ... so call it "4". > > > > This seems to be applicable as it introduces the new interface > while preserving forward compatibility. > > One minor question is that, according to "Figure 6-5. MBA Optimal > Bandwidth Register" in the latest RDT specification, the maximum > value ranges from 1 to 511. > Additionally, this bandwidth field is located at bits 48 to 56 in > the MBA Optimal Bandwidth Register, and the range for > this segment could be 1 to 8191. Just wonder if it would be > possible that the current maximum value of 512 may be extended > in the future? Perhaps we could explore a method to query the maximum upper > limit from the ACPI table or register, or use CPUID to distinguish between > platforms rather than hardcoding it. Reinette also mentioned this in another > thread. > > Thanks, > Chenyu > > > [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html I can't comment on the direction of travel in the RDT architecture. I guess it would be up to the arch code whether to trust ACPI if it says that the maximum value of this field is > 511. (> 65535 would be impossible though, since the fields would start to overlap each other...) Would anything break in the interface proposed here, if the maximum value is larger than 511? (I'm hoping not. For MPAM, the bandwidth controls can have up to 16 bits and the size can be probed though MMIO registers. I don't think we've seen MPAM hardware that comes close to 16 bits for now, though. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 14:13 ` Dave Martin @ 2025-09-29 16:23 ` Luck, Tony 2025-09-30 11:02 ` Chen, Yu C 2025-09-30 4:43 ` Chen, Yu C 1 sibling, 1 reply; 52+ messages in thread From: Luck, Tony @ 2025-09-29 16:23 UTC (permalink / raw) To: Dave Martin, Chen, Yu C Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86@kernel.org, linux-doc@vger.kernel.org > > This seems to be applicable as it introduces the new interface > > while preserving forward compatibility. > > > > One minor question is that, according to "Figure 6-5. MBA Optimal > > Bandwidth Register" in the latest RDT specification, the maximum > > value ranges from 1 to 511. > > Additionally, this bandwidth field is located at bits 48 to 56 in > > the MBA Optimal Bandwidth Register, and the range for > > this segment could be 1 to 8191. Just wonder if it would be 48..56 is still 9 bits, so max value is 511. > > possible that the current maximum value of 512 may be extended > > in the future? Perhaps we could explore a method to query the maximum upper > > limit from the ACPI table or register, or use CPUID to distinguish between > > platforms rather than hardcoding it. Reinette also mentioned this in another > > thread. I think 511 was chosen as "bigger than we expect to ever need" and 9-bits allocated in the registers based on that. Initial implementation may use 255 as the maximum - though I'm pushing on that a bit as the throttle graph at the early stage is fairly linear from "1" to some value < 255, when bandwidth hits maximum, then flat up to 255. If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI table should be the value where peak bandwidth is hit (though this is complicated because workloads with different mixes of read/write access have different throttle graphs). > > > > Thanks, > > Chenyu > > > > > > [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html > > I can't comment on the direction of travel in the RDT architecture. > > I guess it would be up to the arch code whether to trust ACPI if it > says that the maximum value of this field is > 511. (> 65535 would be > impossible though, since the fields would start to overlap each > other...) resctrl should do some sanity checks on values it sees in the ACPI tables. Linux has: #define FW_BUG "[Firmware Bug]: " #define FW_WARN "[Firmware Warn]: " #define FW_INFO "[Firmware Info]: " for good historical reasons. > > Would anything break in the interface proposed here, if the maximum > value is larger than 511? (I'm hoping not. For MPAM, the bandwidth > controls can have up to 16 bits and the size can be probed though MMIO > registers. > > I don't think we've seen MPAM hardware that comes close to 16 bits for > now, though. While kernel code is sometimes space-conserving and uses u8/u16 types for values that fit in some limited range, I'd expect user applications that read the "info" files and program the "schemata" files to not care. Python integers have arbitrary precision, so would be just fine with: max 340282366920938463463374607431768211455 :-) -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 16:23 ` Luck, Tony @ 2025-09-30 11:02 ` Chen, Yu C 2025-09-30 16:08 ` Luck, Tony 0 siblings, 1 reply; 52+ messages in thread From: Chen, Yu C @ 2025-09-30 11:02 UTC (permalink / raw) To: Luck, Tony Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86@kernel.org, linux-doc@vger.kernel.org, Dave Martin On 9/30/2025 12:23 AM, Luck, Tony wrote: >>> This seems to be applicable as it introduces the new interface >>> while preserving forward compatibility. >>> >>> One minor question is that, according to "Figure 6-5. MBA Optimal >>> Bandwidth Register" in the latest RDT specification, the maximum >>> value ranges from 1 to 511. >>> Additionally, this bandwidth field is located at bits 48 to 56 in >>> the MBA Optimal Bandwidth Register, and the range for >>> this segment could be 1 to 8191. Just wonder if it would be > > 48..56 is still 9 bits, so max value is 511. > Ah I see, I overlooked this. >>> possible that the current maximum value of 512 may be extended >>> in the future? Perhaps we could explore a method to query the maximum upper >>> limit from the ACPI table or register, or use CPUID to distinguish between >>> platforms rather than hardcoding it. Reinette also mentioned this in another >>> thread. > > I think 511 was chosen as "bigger than we expect to ever need" and 9-bits > allocated in the registers based on that. > OK, got it. > Initial implementation may use 255 as the maximum - though I'm pushing on > that a bit as the throttle graph at the early stage is fairly linear from "1" to some > value < 255, > when bandwidth hits maximum, then flat up to 255. > If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI > table should be the value where peak bandwidth is hit I see. If I understand correctly, the BIOS needs to pre-train the system to find this Q. However, if the BIOS cannot provide this Q, would it be feasible for the user to provide it? For example, the user could saturate the memory bandwidth, gradually increase MB_MAX, and finally find the Q_max where the memory bandwidth no longer increases. The user could then adjust the max field in the info file. > (though this is complicated > because workloads with different mixes of read/write access have different > throttle graphs). > Does this mean read and write operations have different Q values to saturate the memory bandwidth? For example, if the workload is all reads, there is a Q_r; if the workload is all writes, there is another Q_w. In that case, maybe we could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)). thanks, Chenyu ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-30 11:02 ` Chen, Yu C @ 2025-09-30 16:08 ` Luck, Tony 0 siblings, 0 replies; 52+ messages in thread From: Luck, Tony @ 2025-09-30 16:08 UTC (permalink / raw) To: Chen, Yu C Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86@kernel.org, linux-doc@vger.kernel.org, Dave Martin > >>> This seems to be applicable as it introduces the new interface > >>> while preserving forward compatibility. > >>> > >>> One minor question is that, according to "Figure 6-5. MBA Optimal > >>> Bandwidth Register" in the latest RDT specification, the maximum > >>> value ranges from 1 to 511. > >>> Additionally, this bandwidth field is located at bits 48 to 56 in > >>> the MBA Optimal Bandwidth Register, and the range for > >>> this segment could be 1 to 8191. Just wonder if it would be > > > > 48..56 is still 9 bits, so max value is 511. > > > > Ah I see, I overlooked this. > > >>> possible that the current maximum value of 512 may be extended > >>> in the future? Perhaps we could explore a method to query the maximum upper > >>> limit from the ACPI table or register, or use CPUID to distinguish between > >>> platforms rather than hardcoding it. Reinette also mentioned this in another > >>> thread. > > > > I think 511 was chosen as "bigger than we expect to ever need" and 9-bits > > allocated in the registers based on that. > > > > OK, got it. > > > Initial implementation may use 255 as the maximum - though I'm pushing on > > that a bit as the throttle graph at the early stage is fairly linear from "1" to some > > value < 255, > > when bandwidth hits maximum, then flat up to 255. > > If things stay that way, I'm arguing that the "Q" value enumerated in the ACPI > > table should be the value where peak bandwidth is hit > > I see. If I understand correctly, the BIOS needs to pre-train the system to > find this Q. However, if the BIOS cannot provide this Q, would it be > feasible > for the user to provide it? For example, the user could saturate the memory > bandwidth, gradually increase MB_MAX, and finally find the Q_max where the > memory bandwidth no longer increases. The user could then adjust the max > field in the info file. > > > (though this is complicated > > because workloads with different mixes of read/write access have different > > throttle graphs). > > > > Does this mean read and write operations have different Q values to saturate > the memory bandwidth? For example, if the workload is all reads, there > is a Q_r; > if the workload is all writes, there is another Q_w. In that case, maybe we > could choose the maximum of Q_r and Q_w (max(Q_r, Q_w)). If the BIOS doesn't provide a good enough number, then users might well do some tuning based on the workloads they plan to run and ignore the value in the info file in favor of one tuned specifically for their workloads. But it is too early to start guessing at workarounds for problems that may not even exist. -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 14:13 ` Dave Martin 2025-09-29 16:23 ` Luck, Tony @ 2025-09-30 4:43 ` Chen, Yu C 2025-09-30 15:55 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Chen, Yu C @ 2025-09-30 4:43 UTC (permalink / raw) To: Dave Martin, Luck, Tony Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc On 9/29/2025 10:13 PM, Dave Martin wrote: > Hi there, > > On Mon, Sep 29, 2025 at 05:19:32PM +0800, Chen, Yu C wrote: >> On 9/26/2025 6:58 AM, Luck, Tony wrote: > > [...] > >>> Applying this to Intel upcoming region aware memory bandwidth >>> that supports 255 steps and h/w min/max limits. >>> We would have info files with "min = 1, max = 255" and a schemata >>> file that looks like this to legacy apps: >>> >>> MB: 0=50;1=75 >>> #MB_HW: 0=128;1=191 >>> #MB_MIN: 0=128;1=191 >>> #MB_MAX: 0=128;1=191 >>> >>> But a newer app that is aware of the extensions can write: >>> >>> # cat > schemata << 'EOF' >>> MB_HW: 0=10 >>> MB_MIN: 0=10 >>> MB_MAX: 0=64 >>> EOF >>> >>> which then reads back as: >>> MB: 0=4;1=75 >>> #MB_HW: 0=10;1=191 >>> #MB_MIN: 0=10;1=191 >>> #MB_MAX: 0=64;1=191 >>> >>> with the legacy line updated with the rounded value of the MB_HW >>> supplied by the user. 10/255 = 3.921% ... so call it "4". >>> >> >> This seems to be applicable as it introduces the new interface >> while preserving forward compatibility. >> >> One minor question is that, according to "Figure 6-5. MBA Optimal >> Bandwidth Register" in the latest RDT specification, the maximum >> value ranges from 1 to 511. >> Additionally, this bandwidth field is located at bits 48 to 56 in >> the MBA Optimal Bandwidth Register, and the range for >> this segment could be 1 to 8191. Just wonder if it would be >> possible that the current maximum value of 512 may be extended >> in the future? Perhaps we could explore a method to query the maximum upper >> limit from the ACPI table or register, or use CPUID to distinguish between >> platforms rather than hardcoding it. Reinette also mentioned this in another >> thread. >> >> Thanks, >> Chenyu >> >> >> [1] https://www.intel.com/content/www/us/en/content-details/851356/intel-resource-director-technology-intel-rdt-architecture-specification.html > > I can't comment on the direction of travel in the RDT architecture. > > I guess it would be up to the arch code whether to trust ACPI if it > says that the maximum value of this field is > 511. (> 65535 would be > impossible though, since the fields would start to overlap each > other...) > > Would anything break in the interface proposed here, if the maximum > value is larger than 511? (I'm hoping not. For MPAM, the bandwidth > controls can have up to 16 bits and the size can be probed though MMIO > registers. > I overlooked this bit width. It should not exceed 511 according to the RDT spec. Previously, I was just wondering how to calculate the legacy MB percentage in Tony's example. If we want to keep consistency - if the user provides a value of 10, what is the denominator: Is it 255, 511, or something queried from ACPI. MB: 0=4;1=75 <--- 10/255 #MB_HW: 0=10;1=191 #MB_MIN: 0=10;1=191 #MB_MAX: 0=64;1=191 or MB: 0=1;1=75 <--- 10/511 #MB_HW: 0=10;1=191 #MB_MIN: 0=10;1=191 #MB_MAX: 0=64;1=191 thanks, Chenyu > I don't think we've seen MPAM hardware that comes close to 16 bits for > now, though. > > Cheers > ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-30 4:43 ` Chen, Yu C @ 2025-09-30 15:55 ` Dave Martin 2025-10-01 12:13 ` Chen, Yu C 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-09-30 15:55 UTC (permalink / raw) To: Chen, Yu C Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi, On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote: > On 9/29/2025 10:13 PM, Dave Martin wrote: [...] > > I guess it would be up to the arch code whether to trust ACPI if it > > says that the maximum value of this field is > 511. (> 65535 would be > > impossible though, since the fields would start to overlap each > > other...) > > > > Would anything break in the interface proposed here, if the maximum > > value is larger than 511? (I'm hoping not. For MPAM, the bandwidth > > controls can have up to 16 bits and the size can be probed though MMIO > > registers. > > > > I overlooked this bit width. It should not exceed 511 according to the > RDT spec. Previously, I was just wondering how to calculate the legacy > MB percentage in Tony's example. If we want to keep consistency - if > the user provides a value of 10, what is the denominator: Is it 255, > 511, or something queried from ACPI. > > MB: 0=4;1=75 <--- 10/255 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > or > > MB: 0=1;1=75 <--- 10/511 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > thanks, > Chenyu The denomiator (the "scale" parameter in my model, though the name is unimportant) should be whatever quantity of resource is specified in the "unit" parameter. For "percentage" type controls, I'd expect the unit to be 100% ("100pc" in my syntax). So, Tony suggestion looks plausible to me [1] : | Yes. 255 (or whatever "Q" value is provided in the ACPI table) | corresponds to no throttling, so 100% bandwidth. So, if ACPI says Q=387, that's the denominator we advertise. Does that sound right? Question: is this a global parameter, or per-CPU? From the v1.2 RDT spec, it looks like it is a single, global parameter. I hope this is true (!) But I'm not too familiar with these specs... Cheers ---Dave [1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/ > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-30 15:55 ` Dave Martin @ 2025-10-01 12:13 ` Chen, Yu C 2025-10-02 15:40 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Chen, Yu C @ 2025-10-01 12:13 UTC (permalink / raw) To: Dave Martin Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc On 9/30/2025 11:55 PM, Dave Martin wrote: > Hi, > > On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote: >> On 9/29/2025 10:13 PM, Dave Martin wrote: > > [...] > >>> I guess it would be up to the arch code whether to trust ACPI if it >>> says that the maximum value of this field is > 511. (> 65535 would be >>> impossible though, since the fields would start to overlap each >>> other...) >>> >>> Would anything break in the interface proposed here, if the maximum >>> value is larger than 511? (I'm hoping not. For MPAM, the bandwidth >>> controls can have up to 16 bits and the size can be probed though MMIO >>> registers. >>> >> >> I overlooked this bit width. It should not exceed 511 according to the >> RDT spec. Previously, I was just wondering how to calculate the legacy >> MB percentage in Tony's example. If we want to keep consistency - if >> the user provides a value of 10, what is the denominator: Is it 255, >> 511, or something queried from ACPI. >> >> MB: 0=4;1=75 <--- 10/255 >> #MB_HW: 0=10;1=191 >> #MB_MIN: 0=10;1=191 >> #MB_MAX: 0=64;1=191 >> >> or >> >> MB: 0=1;1=75 <--- 10/511 >> #MB_HW: 0=10;1=191 >> #MB_MIN: 0=10;1=191 >> #MB_MAX: 0=64;1=191 >> >> thanks, >> Chenyu > > The denomiator (the "scale" parameter in my model, though the name is > unimportant) should be whatever quantity of resource is specified in > the "unit" parameter. > > For "percentage" type controls, I'd expect the unit to be 100% ("100pc" > in my syntax). > > So, Tony suggestion looks plausible to me [1] : > > | Yes. 255 (or whatever "Q" value is provided in the ACPI table) > | corresponds to no throttling, so 100% bandwidth. > > So, if ACPI says Q=387, that's the denominator we advertise. > > Does that sound right? > Yes, it makes sense, the denominator is the "scale" in your example. > Question: is this a global parameter, or per-CPU? > It should be a global setting for all the MBA Register Blocks. Thanks, Chenyu > From the v1.2 RDT spec, it looks like it is a single, global parameter. > I hope this is true (!) But I'm not too familiar with these specs... > > Cheers > ---Dave > > > [1] https://lore.kernel.org/lkml/aNq11fmlac6dH4pH@agluck-desk3/ >> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-01 12:13 ` Chen, Yu C @ 2025-10-02 15:40 ` Dave Martin 2025-10-02 16:43 ` Luck, Tony 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-10-02 15:40 UTC (permalink / raw) To: Chen, Yu C Cc: Luck, Tony, linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi there, On Wed, Oct 01, 2025 at 08:13:45PM +0800, Chen, Yu C wrote: > On 9/30/2025 11:55 PM, Dave Martin wrote: > > Hi, > > > > On Tue, Sep 30, 2025 at 12:43:36PM +0800, Chen, Yu C wrote: [...] > > > I overlooked this bit width. It should not exceed 511 according to the > > > RDT spec. Previously, I was just wondering how to calculate the legacy > > > MB percentage in Tony's example. If we want to keep consistency - if > > > the user provides a value of 10, what is the denominator: Is it 255, > > > 511, or something queried from ACPI. > > > > > > MB: 0=4;1=75 <--- 10/255 > > > #MB_HW: 0=10;1=191 > > > #MB_MIN: 0=10;1=191 > > > #MB_MAX: 0=64;1=191 > > > > > > or > > > > > > MB: 0=1;1=75 <--- 10/511 > > > #MB_HW: 0=10;1=191 > > > #MB_MIN: 0=10;1=191 > > > #MB_MAX: 0=64;1=191 > > > > > > thanks, > > > Chenyu > > > > The denomiator (the "scale" parameter in my model, though the name is > > unimportant) should be whatever quantity of resource is specified in > > the "unit" parameter. > > > > For "percentage" type controls, I'd expect the unit to be 100% ("100pc" > > in my syntax). > > > > So, Tony suggestion looks plausible to me [1] : > > > > | Yes. 255 (or whatever "Q" value is provided in the ACPI table) > > | corresponds to no throttling, so 100% bandwidth. > > > > So, if ACPI says Q=387, that's the denominator we advertise. > > > > Does that sound right? > > > > Yes, it makes sense, the denominator is the "scale" in your example. Thanks for confirming that. > > Question: is this a global parameter, or per-CPU? > > > > It should be a global setting for all the MBA Register Blocks. That's good -- since resctrl resource controls are not per-CPU, exposing the exact hardware resolution won't work unless the value is scaled identically for all CPUs. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-02 15:40 ` Dave Martin @ 2025-10-02 16:43 ` Luck, Tony 0 siblings, 0 replies; 52+ messages in thread From: Luck, Tony @ 2025-10-02 16:43 UTC (permalink / raw) To: Dave Martin, Chen, Yu C Cc: linux-kernel@vger.kernel.org, Chatre, Reinette, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86@kernel.org, linux-doc@vger.kernel.org > > > So, if ACPI says Q=387, that's the denominator we advertise. > > > > > > Does that sound right? > > > > > > > Yes, it makes sense, the denominator is the "scale" in your example. > > Thanks for confirming that. > > > > Question: is this a global parameter, or per-CPU? > > > > > > > It should be a global setting for all the MBA Register Blocks. > > That's good -- since resctrl resource controls are not per-CPU, > exposing the exact hardware resolution won't work unless the value > is scaled identically for all CPUs. The RDT architecture spec says there is a separate MARC table that describes each instance on an L3 cache. So in theory there could be different "Q" values for each. I'm chatting with the architects to point out that would be bad, and they shouldn't build something that has different "Q" values on the same system. -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-25 22:58 ` Luck, Tony 2025-09-29 9:19 ` Chen, Yu C @ 2025-09-29 13:56 ` Dave Martin 2025-09-29 16:09 ` Reinette Chatre 2025-09-29 16:37 ` Luck, Tony 1 sibling, 2 replies; 52+ messages in thread From: Dave Martin @ 2025-09-29 13:56 UTC (permalink / raw) To: Luck, Tony Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Tony, Thanks for taking at look at this -- comments below. [...] On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: [...] > > Would something like the following work? A read from schemata might > > produce something like this: > > > > MB: 0=50, 1=50 > > # MB_HW: 0=32, 1=32 > > # MB_MIN: 0=31, 1=31 > > # MB_MAX: 0=32, 1=32 [...] > > I'd be interested in people's thoughts on it, though. > > Applying this to Intel upcoming region aware memory bandwidth > that supports 255 steps and h/w min/max limits. Following the MPAM example, would you also expect: scale: 255 unit: 100pc ...? > We would have info files with "min = 1, max = 255" and a schemata > file that looks like this to legacy apps: > > MB: 0=50;1=75 > #MB_HW: 0=128;1=191 > #MB_MIN: 0=128;1=191 > #MB_MAX: 0=128;1=191 > > But a newer app that is aware of the extensions can write: > > # cat > schemata << 'EOF' > MB_HW: 0=10 > MB_MIN: 0=10 > MB_MAX: 0=64 > EOF > > which then reads back as: > MB: 0=4;1=75 > #MB_HW: 0=10;1=191 > #MB_MIN: 0=10;1=191 > #MB_MAX: 0=64;1=191 > > with the legacy line updated with the rounded value of the MB_HW > supplied by the user. 10/255 = 3.921% ... so call it "4". I'm suggesting that this always be rounded up, so that you have a guarantee that the steps are no smaller than the reported value. (In this case, round-up and round-to-nearest give the same answer anyway, though!) > > The region aware h/w supports separate bandwidth controls for each > region. We could hope (or perhaps update the spec to define) that > region0 is always node-local DDR memory and keep the "MB" tag for > that. Do you have concerns about existing software choking on the #-prefixed lines? > Then use some other tag naming for other regions. Remote DDR, > local CXL, remote CXL are the ones we think are next in the h/w > memory sequence. But the "region" concept would allow for other > options as other memory technologies come into use. Would it be reasnable just to have a set of these schema instances, per region, so: MB_HW: ... // implicitly region 0 MB_HW_1: ... MB_HW_2: ... etc. Or, did you have something else in mind? My thinking is that we avoid adding complexity in the schemata file if we treat mapping these schema instances onto the hardware topology as an orthogonal problem. So long as we have unique names in the schemata file, we can describe elsewhere what they relate to in the hardware. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 13:56 ` Dave Martin @ 2025-09-29 16:09 ` Reinette Chatre 2025-09-30 15:40 ` Dave Martin 2025-09-29 16:37 ` Luck, Tony 1 sibling, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-09-29 16:09 UTC (permalink / raw) To: Dave Martin, Luck, Tony Cc: linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 9/29/25 6:56 AM, Dave Martin wrote: > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: >> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: ... >> The region aware h/w supports separate bandwidth controls for each >> region. We could hope (or perhaps update the spec to define) that >> region0 is always node-local DDR memory and keep the "MB" tag for >> that. > > Do you have concerns about existing software choking on the #-prefixed > lines? I am trying to understand the purpose of the #-prefix. I see two motivations for the #-prefix with the primary point that multiple schema apply to the same resource. 1) Commented schema are "inactive" This is unclear to me. In the MB example the commented lines show the finer grained controls. Since the original MB resource is an approximation and the hardware must already be configured to support it, would the #-prefixed lines not show the actual "active" configuration? 2) Commented schema are "conflicting" The original proposal mentioned "write them back instead of (or in addition to) the conflicting entries". I do not know how resctrl will be able to handle a user requesting a change to both "MB" and "MB_HW". This seems like something that should fail? On a high level it is not clear to me why the # prefix is needed. As I understand the schemata names will always be unique and the new features made backward compatible to existing schemata names. That is, existing MB, L3, etc. will also have the new info files that describe their values/ranges. I expect that user space will ignore schema that it is not familiar with so the # prefix seems unnecessary? I believe the motivation is to express a relationship between different schema (you mentioned "shadow" initially). I think this relationship can be expressed clearly by using a namespace prefix (like "MB_" in the examples). This may help more when there are multiple schemata with this format where a #-prefix does not make obvious which resource is shadowed. >> Then use some other tag naming for other regions. Remote DDR, >> local CXL, remote CXL are the ones we think are next in the h/w >> memory sequence. But the "region" concept would allow for other >> options as other memory technologies come into use. > > Would it be reasnable just to have a set of these schema instances, per > region, so: > > MB_HW: ... // implicitly region 0 > MB_HW_1: ... > MB_HW_2: ... > > etc. > > Or, did you have something else in mind? > > My thinking is that we avoid adding complexity in the schemata file if > we treat mapping these schema instances onto the hardware topology as > an orthogonal problem. So long as we have unique names in the schemata > file, we can describe elsewhere what they relate to in the hardware. Agreed ... and "elsewhere" is expected to be unique depending on the resource. Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 16:09 ` Reinette Chatre @ 2025-09-30 15:40 ` Dave Martin 2025-10-10 16:48 ` Reinette Chatre 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-09-30 15:40 UTC (permalink / raw) To: Reinette Chatre Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi, On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/29/25 6:56 AM, Dave Martin wrote: > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > >> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > > ... > > >> The region aware h/w supports separate bandwidth controls for each > >> region. We could hope (or perhaps update the spec to define) that > >> region0 is always node-local DDR memory and keep the "MB" tag for > >> that. > > > > Do you have concerns about existing software choking on the #-prefixed > > lines? > > I am trying to understand the purpose of the #-prefix. I see two motivations > for the #-prefix with the primary point that multiple schema apply to the same > resource. > > 1) Commented schema are "inactive" > This is unclear to me. In the MB example the commented lines show the > finer grained controls. Since the original MB resource is an approximation > and the hardware must already be configured to support it, would the #-prefixed > lines not show the actual "active" configuration? They would show the active configuration (possibly more precisely than "MB" does). If not, it's not clear how userspace that is trying to use MB_HW (say) could read out the current configuration. The # is intended to make resctrl ignore the lines when the file is written by userspace. This is done so that userspace has to actually change those lines in order for them to take effect when writing. Old userspace can just pass them through without modification, without anything unexpected happening. The reason why I think that this convention may be needed is that we never told (old) userspace what it was supposed to do with schemata entries that it does not recognise. > 2) Commented schema are "conflicting" > The original proposal mentioned "write them back instead of (or in addition to) > the conflicting entries". I do not know how resctrl will be able to > handle a user requesting a change to both "MB" and "MB_HW". This seems like > something that should fail? If userspace is asking for two incompatible things at the same time, we can either pick one of them and ignore the rest, or do nothing, or fail explicitly. If we think that it doesn't really matter what happens, then resctrl could just dumbly process the entries in the order given. If the result is not what userspace wanted, that's not our problem. (Today, nothing prevents userspace writing multiple "MB" lines at the same time: resctrl will process them all, but only the final one will have a lasting effect. So, the fact that a resctrl write can contain mutually incompatible requests does not seem to be new.) > On a high level it is not clear to me why the # prefix is needed. As I understand the > schemata names will always be unique and the new features made backward > compatible to existing schemata names. That is, existing MB, L3, etc. > will also have the new info files that describe their values/ranges. Regarding backwards compatibility for the existing controls: This proposal is only about numeric controls. L3 wouldn't change, but we could still add info/ metadata for bitmap control at the same time as adding it for numeric controls. MB may be hard to describe in a useful way, though -- at least in the MPAM case, where the number of steps does not divide into 100, and the AMD cases where the meaning of the MB control values is different. MB and MB_HW are not interchangeable. To obtain predictable results from MB, userspace would need to know precisely how the kernel is going to round the value. This feels like an implementation detail that doesn't belong in the ABI. I suppose we could also add a "granularity" entry in info/, but we have the existing "bandwidth_gran" file for MB. For any new schema, I don't think we need to state the granularity: the other parameters can always be adjusted so that the granularity is exactly 1. Regarding the "#" (see below): > I expect that user space will ignore schema that it is not familiar > with so the # prefix seems unnecessary? > > I believe the motivation is to express a relationship between different > schema (you mentioned "shadow" initially). I think this relationship can > be expressed clearly by using a namespace prefix (like "MB_" in the examples). > This may help more when there are multiple schemata with this format where a #-prefix > does not make obvious which resource is shadowed. An illustration would probably help, here. Say resctrl knows has schemata MB, MB_HW, MB_MIN and MB_MAX, all of which control (aspects of) the same underlying hardware resource. Reading the schemata file might yield: MB: 0=29 MB_HW: 0=2 MB_MIN: 0=1 MB_MAX: 0=2 (I assume for this toy example that MB_{HW,MIN,MAX} ranges from 0 to 7.) Now, suppose (current) userspace wants to change the allocated bandwidth. It only understands the "MB" line, but it is reasonable to expect that writing the other lines back without modification will do nothing. (A human user might read the file, and tweak it through an editor to modify just the entry of interest, run it through awk, etc.) So, the user writes back: MB: 0=43 MB_HW: 0=2 MB_MIN: 0=1 MB_MAX: 0=2 If resctrl just processes the entries in order, it will temporarily change the bandwidth allocation due to the "MB" row, but will then immediately change it back again due to the other rows. Reading schemata again now gives: MB: 0=29 MB_HW: 0=2 MB_MIN: 0=1 MB_MAX: 0=2 We might be able to solve some problems of this sort be reordering the entries, but I suspect that some software may give up as soon as it sees an unfamiliar entry -- so it may be better to keep the classic entries (like "MB") at the start. Anyway, going back to the "#" convention: If the initial read of schemata has the new entries "pre-commented", then userspace wouldn't need to know about the new entries. It could just tweak the MB entry (which it knows about), and write the file back: MB: 0=43 # MB_HW: 0=2 # MB_MIN: 0=1 # MB_MAX: 0=2 then resctrl knows to ignore the hashed lines, and so reading the file back gives: MB: 0=43 # MB_HW: 0=3 # MB_MIN: 0=2 # MB_MAX: 0=3 (For hardware-specific reasons, the MPAM driver currently internally programs the MIN bound to be a bit less than the MAX bound, when userspace writes an "MB" entry into schemata. The key thing is that writing MB may cause the MB_MIN/MB_MAX entries to change -- at the resctrl level, I don't that that we necessarily need to make promises about what they can change _to_. The exact effect of MIN and MAX bounds is likely to be hardware-dependent anyway.) Regarding new userspce: Going forward, we can explicitly document that there should be no conflicting or "passenger" entries in a schemata write: don't include an entry for somehing that you don't explicitly want to set, and if multiple entries affect the same resource, we don't promise what happens. (But sadly, we can't impose that rule on existing software after the fact.) One final note: I have not provided any way to indicate that all those entries control the same hardware resource. The common "MB" prefix is intended as a clue, but ultimately, userspace needs to know what an entry controls before tweaking it. We could try to describe the relationships explicitly, but I'm not sure that it is useful... > >> Then use some other tag naming for other regions. Remote DDR, > >> local CXL, remote CXL are the ones we think are next in the h/w > >> memory sequence. But the "region" concept would allow for other > >> options as other memory technologies come into use. > > > > Would it be reasnable just to have a set of these schema instances, per > > region, so: > > > > MB_HW: ... // implicitly region 0 > > MB_HW_1: ... > > MB_HW_2: ... > > > > etc. > > > > Or, did you have something else in mind? > > > > My thinking is that we avoid adding complexity in the schemata file if > > we treat mapping these schema instances onto the hardware topology as > > an orthogonal problem. So long as we have unique names in the schemata > > file, we can describe elsewhere what they relate to in the hardware. > > Agreed ... and "elsewhere" is expected to be unique depending on the resource. > > Reinette Yes Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-30 15:40 ` Dave Martin @ 2025-10-10 16:48 ` Reinette Chatre 2025-10-11 17:15 ` Chen, Yu C 2025-10-13 14:36 ` Dave Martin 0 siblings, 2 replies; 52+ messages in thread From: Reinette Chatre @ 2025-10-10 16:48 UTC (permalink / raw) To: Dave Martin Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 9/30/25 8:40 AM, Dave Martin wrote: > On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: >> On 9/29/25 6:56 AM, Dave Martin wrote: >>> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: >>>> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: >> >> ... >> >>>> The region aware h/w supports separate bandwidth controls for each >>>> region. We could hope (or perhaps update the spec to define) that >>>> region0 is always node-local DDR memory and keep the "MB" tag for >>>> that. >>> >>> Do you have concerns about existing software choking on the #-prefixed >>> lines? >> >> I am trying to understand the purpose of the #-prefix. I see two motivations >> for the #-prefix with the primary point that multiple schema apply to the same >> resource. >> >> 1) Commented schema are "inactive" >> This is unclear to me. In the MB example the commented lines show the >> finer grained controls. Since the original MB resource is an approximation >> and the hardware must already be configured to support it, would the #-prefixed >> lines not show the actual "active" configuration? > > They would show the active configuration (possibly more precisely than > "MB" does). That is how I see it also. This is specific to MB as we try to maintain backward compatibility. If we are going to make user interface changes to resource allocation then ideally it should consider all known future usage. I am trying to navigate and understand the discussion on how resctrl can support MPAM and this RDT region aware requirements. I scanned the MPAM spec and from what I understand a resource may support multiple controls at the same time, each with its own properties, and then there was this: When multiple partitioning controls are active, each affects the partition’s bandwidth usage. However, some combinations of controls may not make sense, because the regulation of that pair of controls cannot be made to work in concert. resctrl may thus present an "active configuration" that is not a configuration that "makes sense" ... this may be ok as resctrl would present what hardware supports combined with what user requested. > If not, it's not clear how userspace that is trying to use MB_HW (say) > could read out the current configuration. > > The # is intended to make resctrl ignore the lines when the file > is written by userspace. This is done so that userspace has to > actually change those lines in order for them to take effect when > writing. Old userspace can just pass them through without modification, > without anything unexpected happening. Thank you for highlighting this. I did not consider this use case. > > The reason why I think that this convention may be needed is that we > never told (old) userspace what it was supposed to do with schemata > entries that it does not recognise. > > >> 2) Commented schema are "conflicting" >> The original proposal mentioned "write them back instead of (or in addition to) >> the conflicting entries". I do not know how resctrl will be able to >> handle a user requesting a change to both "MB" and "MB_HW". This seems like >> something that should fail? > > If userspace is asking for two incompatible things at the same time, we > can either pick one of them and ignore the rest, or do nothing, or fail > explicitly. > > If we think that it doesn't really matter what happens, then resctrl > could just dumbly process the entries in the order given. If the > result is not what userspace wanted, that's not our problem. > > (Today, nothing prevents userspace writing multiple "MB" lines at the > same time: resctrl will process them all, but only the final one will > have a lasting effect. So, the fact that a resctrl write can contain > mutually incompatible requests does not seem to be new.) Good point. > > >> On a high level it is not clear to me why the # prefix is needed. As I understand the >> schemata names will always be unique and the new features made backward >> compatible to existing schemata names. That is, existing MB, L3, etc. >> will also have the new info files that describe their values/ranges. > > Regarding backwards compatibility for the existing controls: > > This proposal is only about numeric controls. L3 wouldn't change, but > we could still add info/ metadata for bitmap control at the same time > as adding it for numeric controls. I think we should. At least we should leave space for such an addition since it is not obvious to me how multiple resources with different controls or single resource with multiple controls should be communicated to user space. To be specific, the original proposal [1] introduced a set of files for a numeric control and that seems to work for existing and upcoming schema that need a value in a range. Different controls need different parameters so to integrate this solution I think it needs another parameter (presented as a directory, a file, or within a file) that indicates the type of the control so that user space knows which files/parameters to expect and how to interpret them. Since different controls have different parameters we need to consider whether it is easier to create/parse unique files for each control or present all the parameters within one file with another file noting the type of control. I understand the files/parameters are intended to be in the schema's info directory but how this will look is not obvious to me. Part of the MPAM refactoring transitioned the top level info directories to represent the schema entries that currently reflect the resources. When we start having multiple schema entries (multiple controls) for a single resource the simplest implementation may result in a top level info directory for every schema entry ... but the expectation is that these top level directories should be per resource, no? At this time I am envisioning the proposal to result in something like below where there is one resource directory and one directory per schema entry with a (added by me) "schema_type" file to help user find out what the schema type is to know which files are present: MB ├── bandwidth_gran ├── delay_linear ├── MB │ ├── map │ ├── max │ ├── min │ ├── scale │ ├── schema_type │ └── unit ├── MB_HW │ ├── map │ ├── max │ ├── min │ ├── scale │ ├── schema_type │ └── unit ├── MB_MAX │ └── tbd ├── MB_MIN │ └── tbd ├── min_bandwidth ├── num_closids └── thread_throttle_mode Something else related to control that caught my eye in MPAM spec is this gem: MPAM provides discoverable vendor extensions to permit partners to invent partitioning controls. > MB may be hard to describe in a useful way, though -- at least in the > MPAM case, where the number of steps does not divide into 100, and the > AMD cases where the meaning of the MB control values is different. Above I do assume that MB would be represented in a new interface since it is a schema entry, if that causes trouble then we could drop it. > > MB and MB_HW are not interchangeable. To obtain predictable results > from MB, userspace would need to know precisely how the kernel is going > to round the value. This feels like an implementation detail that > doesn't belong in the ABI. ack ... > Anyway, going back to the "#" convention: > > If the initial read of schemata has the new entries "pre-commented", > then userspace wouldn't need to know about the new entries. It could > just tweak the MB entry (which it knows about), and write the file back: > > MB: 0=43 > # MB_HW: 0=2 > # MB_MIN: 0=1 > # MB_MAX: 0=2 > > then resctrl knows to ignore the hashed lines, and so reading the file > back gives: > > MB: 0=43 > # MB_HW: 0=3 > # MB_MIN: 0=2 > # MB_MAX: 0=3 Thank you for the example. This seems reasonable. I would like to go back to what you wrote in [1]: > Software that understands the new entries can uncomment the conflicting > entries and write them back instead of (or in addition to) the > conflicting entries. For example, userspace might write the following: > > MB_MIN: 0=16, 1=16 > MB_MAX: 0=32, 1=32 > > Which might then read back as follows: > > MB: 0=50, 1=50 > # MB_HW: 0=32, 1=32 > # MB_MIN: 0=16, 1=16 > # MB_MAX: 0=32, 1=32 Could/should resctrl uncomment the lines after userspace modified them? > > (For hardware-specific reasons, the MPAM driver currently internally > programs the MIN bound to be a bit less than the MAX bound, when > userspace writes an "MB" entry into schemata. The key thing is that > writing MB may cause the MB_MIN/MB_MAX entries to change -- at the > resctrl level, I don't that that we necessarily need to make promises > about what they can change _to_. The exact effect of MIN and MAX > bounds is likely to be hardware-dependent anyway.) MPAM has the "HARDLIM" distinction associated with these MAX values and from what I can tell this is per PARTID. Is this something that needs to be supported? To do this resctrl will need to support modifying control properties per resource group. > > > Regarding new userspce: > > Going forward, we can explicitly document that there should be no > conflicting or "passenger" entries in a schemata write: don't include > an entry for somehing that you don't explicitly want to set, and if > multiple entries affect the same resource, we don't promise what > happens. > > (But sadly, we can't impose that rule on existing software after the > fact.) It may thus not be worth it to make such a rule. > > > One final note: I have not provided any way to indicate that all those > entries control the same hardware resource. The common "MB" prefix is > intended as a clue, but ultimately, userspace needs to know what an > entry controls before tweaking it. > > We could try to describe the relationships explicitly, but I'm not sure > that it is useful... What other relationships should we consider for MPAM? I see that each MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual, ... ? Is it expected that MPAM's support of these should be exposed via resctrl? Have you considered how to express if user wants hardware to have different allocations for, for example, same PARTID at different execution levels? Reinette [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-10 16:48 ` Reinette Chatre @ 2025-10-11 17:15 ` Chen, Yu C 2025-10-13 15:01 ` Dave Martin 2025-10-13 14:36 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Chen, Yu C @ 2025-10-11 17:15 UTC (permalink / raw) To: Reinette Chatre, Dave Martin Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc On 10/11/2025 12:48 AM, Reinette Chatre wrote: > Hi Dave, > > On 9/30/25 8:40 AM, Dave Martin wrote: >> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: >>> On 9/29/25 6:56 AM, Dave Martin wrote: >>>> On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: >>>>> On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: [snip] >> Anyway, going back to the "#" convention: >> >> If the initial read of schemata has the new entries "pre-commented", >> then userspace wouldn't need to know about the new entries. It could >> just tweak the MB entry (which it knows about), and write the file back: >> >> MB: 0=43 >> # MB_HW: 0=2 >> # MB_MIN: 0=1 >> # MB_MAX: 0=2 >> >> then resctrl knows to ignore the hashed lines, and so reading the file >> back gives: >> >> MB: 0=43 >> # MB_HW: 0=3 >> # MB_MIN: 0=2 >> # MB_MAX: 0=3 May I ask if introducing the pre-commented lines is intended to prevent control conflicts over the same MBA? If this is the case, I wonder if, instead of exposing both MB and the pre-commented MB_HW_X in one file, it would be feasible to introduce a new mount option (such as "hw") to make the legacy MB and MB_HW_X mutually exclusive. If the user specifies "hw" in mount option, display MB_HW_X (if available); otherwise, display only the legacy "MB". This is similar to the cpufreq governor, where only one governor is allowed to adjust the CPU frequency at a time. thanks, Chenyu ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-11 17:15 ` Chen, Yu C @ 2025-10-13 15:01 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-10-13 15:01 UTC (permalink / raw) To: Chen, Yu C Cc: Reinette Chatre, Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi, On Sun, Oct 12, 2025 at 01:15:07AM +0800, Chen, Yu C wrote: > On 10/11/2025 12:48 AM, Reinette Chatre wrote: > > Hi Dave, > > > > On 9/30/25 8:40 AM, Dave Martin wrote: > > > On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: > > > > On 9/29/25 6:56 AM, Dave Martin wrote: > > > > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > > > > > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > > [snip] > > > > Anyway, going back to the "#" convention: > > > > > > If the initial read of schemata has the new entries "pre-commented", > > > then userspace wouldn't need to know about the new entries. It could > > > just tweak the MB entry (which it knows about), and write the file back: > > > > > > MB: 0=43 > > > # MB_HW: 0=2 > > > # MB_MIN: 0=1 > > > # MB_MAX: 0=2 > > > > > > then resctrl knows to ignore the hashed lines, and so reading the file > > > back gives: > > > > > > MB: 0=43 > > > # MB_HW: 0=3 > > > # MB_MIN: 0=2 > > > # MB_MAX: 0=3 > > > May I ask if introducing the pre-commented lines is intended to prevent > control conflicts over the same MBA? If this is the case, I wonder if, Basically, yes. Note, this is only an issue for old software that doesn't understand the new entries. New software should omit the entries for resources that doesn't understand or doesn't want to set. > instead of exposing both MB and the pre-commented MB_HW_X in one file, > it would be feasible to introduce a new mount option (such as "hw") to > make the legacy MB and MB_HW_X mutually exclusive. If the user specifies > "hw" in mount option, display MB_HW_X (if available); otherwise, display > only the legacy "MB". This is similar to the cpufreq governor, where only > one governor is allowed to adjust the CPU frequency at a time. > > thanks, > Chenyu This could be done with a mount option, but I am concerned that a single resctrl mount might be used by different tools while it is mounted. So, maybe a global control is not what we want? I'm hoping that we can design this in a way that is sufficiently backwards compatible that we don't need a mount option to turn it off. Can you think of a userspace scenario that would break, with the proposed design? Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-10 16:48 ` Reinette Chatre 2025-10-11 17:15 ` Chen, Yu C @ 2025-10-13 14:36 ` Dave Martin 2025-10-14 22:55 ` Reinette Chatre 1 sibling, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-10-13 14:36 UTC (permalink / raw) To: Reinette Chatre Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 9/30/25 8:40 AM, Dave Martin wrote: > > On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: > >> On 9/29/25 6:56 AM, Dave Martin wrote: [...] > >> 1) Commented schema are "inactive" > >> This is unclear to me. In the MB example the commented lines show the > >> finer grained controls. Since the original MB resource is an approximation > >> and the hardware must already be configured to support it, would the #-prefixed > >> lines not show the actual "active" configuration? > > > > They would show the active configuration (possibly more precisely than > > "MB" does). > > That is how I see it also. This is specific to MB as we try to maintain > backward compatibility. > > If we are going to make user interface changes to resource allocation then > ideally it should consider all known future usage. I am trying to navigate > and understand the discussion on how resctrl can support MPAM and this > RDT region aware requirements. > > I scanned the MPAM spec and from what I understand a resource may support > multiple controls at the same time, each with its own properties, and then > there was this: > > When multiple partitioning controls are active, each affects the partition’s > bandwidth usage. However, some combinations of controls may not make sense, > because the regulation of that pair of controls cannot be made to work in concert. > > resctrl may thus present an "active configuration" that is not a configuration > that "makes sense" ... this may be ok as resctrl would present what hardware > supports combined with what user requested. This is analogous to what the MPAM spec says, though if resctrl offers two different schemata for the same hardware control, the control cannot be configured with both values simultaneously. For the MPAM hardware controls affecting the same hardware resource, they can be programmed to combinations of values that have no sensible interpretation, and the values can be read back just fine. The performance effects may not be what the user expected / wanted, but this is not directly visible to resctrl. So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that will read back just as programmed. The architecture does not promise what the performance effect of this will be, but resctrl does not need to care. [...] > > The # is intended to make resctrl ignore the lines when the file > > is written by userspace. This is done so that userspace has to > > actually change those lines in order for them to take effect when > > writing. Old userspace can just pass them through without modification, > > without anything unexpected happening. > > Thank you for highlighting this. I did not consider this use case. [...] > >> 2) Commented schema are "conflicting" > >> The original proposal mentioned "write them back instead of (or in addition to) > >> the conflicting entries". I do not know how resctrl will be able to > >> handle a user requesting a change to both "MB" and "MB_HW". This seems like > >> something that should fail? > > > > If userspace is asking for two incompatible things at the same time, we > > can either pick one of them and ignore the rest, or do nothing, or fail > > explicitly. > > > > If we think that it doesn't really matter what happens, then resctrl > > could just dumbly process the entries in the order given. If the > > result is not what userspace wanted, that's not our problem. > > > > (Today, nothing prevents userspace writing multiple "MB" lines at the > > same time: resctrl will process them all, but only the final one will > > have a lasting effect. So, the fact that a resctrl write can contain > > mutually incompatible requests does not seem to be new.) > > Good point. [...] > > This proposal is only about numeric controls. L3 wouldn't change, but > > we could still add info/ metadata for bitmap control at the same time > > as adding it for numeric controls. > > I think we should. At least we should leave space for such an addition since > it is not obvious to me how multiple resources with different controls or > single resource with multiple controls should be communicated to user space. > > To be specific, the original proposal [1] introduced a set of files for > a numeric control and that seems to work for existing and upcoming > schema that need a value in a range. Different controls need different > parameters so to integrate this solution I think it needs another parameter > (presented as a directory, a file, or within a file) that indicates the > type of the control so that user space knows which files/parameters to expect > and how to interpret them. Agreed. I wasn't meaning to imply that this proposal shouldn't be integrated into something more general. If we want a richer description than the current one, it makes sense to incorporate bitmap controls -- this just wasn't my focus. > Since different controls have different parameters we need to consider > whether it is easier to create/parse unique files for each control or > present all the parameters within one file with another file noting the type > of control. Separate files works quite well for low-tech tooling built using shell scripts, and this seems to follow the sysfs philosophy. Since there is no need to keep re-reading these parameters, simplicity feels more important than efficiency? But we could equally have a single file with multiple pieces of information in it. I don't have a strong view on this. > I understand the files/parameters are intended to be in the schema's info directory > but how this will look is not obvious to me. Part of the MPAM refactoring transitioned > the top level info directories to represent the schema entries that currently reflect > the resources. When we start having multiple schema entries (multiple controls) for a > single resource the simplest implementation may result in a top level info > directory for every schema entry ... but the expectation is that these top > level directories should be per resource, no? I had not considered that the info/ directories correspond to resources, not individual schemata... > > At this time I am envisioning the proposal to result in something like below where > there is one resource directory and one directory per schema entry with a (added by me) > "schema_type" file to help user find out what the schema type is to know which files are present: > > MB > ├── bandwidth_gran > ├── delay_linear > ├── MB > │ ├── map > │ ├── max > │ ├── min > │ ├── scale > │ ├── schema_type > │ └── unit > ├── MB_HW > │ ├── map > │ ├── max > │ ├── min > │ ├── scale > │ ├── schema_type > │ └── unit > ├── MB_MAX > │ └── tbd > ├── MB_MIN > │ └── tbd > ├── min_bandwidth > ├── num_closids > └── thread_throttle_mode I see no reason not to do that. Either way, older userspace just ignores the new files and directories. Perhaps add an intermediate subdirectory to clarify the relationship between the resource dir and the individual schema descriptions? This may also avoid the new descriptions getting mixed up with the old description files. Say, info ├── MB │ ├── resource_schemata │ │ ├── MB │ │ │ ├── map │ │ │ ├── max │ ┆ │ ├── min │ │ ┆ ┆ │ ├── MB_HW │ ├── map │ ┆ ┆ > Something else related to control that caught my eye in MPAM spec is this gem: > MPAM provides discoverable vendor extensions to permit partners > to invent partitioning controls. Yup. Since we have no way to know what vendor-specific controls look like or what they mean, we can't do much about this. So, it's the vendor's job to implement support for it, and we might still say no (if there is no sane way to integrate it). > > MB may be hard to describe in a useful way, though -- at least in the > > MPAM case, where the number of steps does not divide into 100, and the > > AMD cases where the meaning of the MB control values is different. > > Above I do assume that MB would be represented in a new interface since it > is a schema entry, if that causes trouble then we could drop it. Since MB is described by the existing files and the documentation, perhaps this it doesn't need an additional description. Alternatively though, could we just have a special schema_type for this, and omit the other properties? This would mean that we at least have an entry for every schema. > > > > > MB and MB_HW are not interchangeable. To obtain predictable results > > from MB, userspace would need to know precisely how the kernel is going > > to round the value. This feels like an implementation detail that > > doesn't belong in the ABI. > > ack > > ... > > > Anyway, going back to the "#" convention: > > > > If the initial read of schemata has the new entries "pre-commented", > > then userspace wouldn't need to know about the new entries. It could > > just tweak the MB entry (which it knows about), and write the file back: > > > > MB: 0=43 > > # MB_HW: 0=2 > > # MB_MIN: 0=1 > > # MB_MAX: 0=2 > > > > then resctrl knows to ignore the hashed lines, and so reading the file > > back gives: > > > > MB: 0=43 > > # MB_HW: 0=3 > > # MB_MIN: 0=2 > > # MB_MAX: 0=3 > > Thank you for the example. This seems reasonable. I would like to go back > to what you wrote in [1]: > > > Software that understands the new entries can uncomment the conflicting > > entries and write them back instead of (or in addition to) the > > conflicting entries. For example, userspace might write the following: > > > > MB_MIN: 0=16, 1=16 > > MB_MAX: 0=32, 1=32 > > > > Which might then read back as follows: > > > > MB: 0=50, 1=50 > > # MB_HW: 0=32, 1=32 > > # MB_MIN: 0=16, 1=16 > > # MB_MAX: 0=32, 1=32 > > Could/should resctrl uncomment the lines after userspace modified them? The '#' wasn't meant to be a state that gets turned on and off. Rather, userspace would use this to indicate which entries are intentionally being modified. So long as the entries affecting a single resource are ordered so that each entry is strictly more specific than the previous entries (as illustrated above), then reading schemata and stripping all the hashes would allow a previous configuration to be restored; to change just one entry, userspace can uncomment just that one, or write only that entry (which is what I think we should recommend for new software). > > (For hardware-specific reasons, the MPAM driver currently internally > > programs the MIN bound to be a bit less than the MAX bound, when > > userspace writes an "MB" entry into schemata. The key thing is that > > writing MB may cause the MB_MIN/MB_MAX entries to change -- at the > > resctrl level, I don't that that we necessarily need to make promises > > about what they can change _to_. The exact effect of MIN and MAX > > bounds is likely to be hardware-dependent anyway.) > > MPAM has the "HARDLIM" distinction associated with these MAX values > and from what I can tell this is per PARTID. Is this something that needs > to be supported? To do this resctrl will need to support modifying > control properties per resource group. Possibly. Since this is a boolean control that determines how the MBW_MAX control is applied, we could perhaps present it as an additional schema -- if so, it's basically orthogonal. | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...] or | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...] Does this look reasonable? I don't know whether we have a clear use case for this today, and like almost everything else in MPAM, implementing it is optional... > > Regarding new userspce: > > > > Going forward, we can explicitly document that there should be no > > conflicting or "passenger" entries in a schemata write: don't include > > an entry for somehing that you don't explicitly want to set, and if > > multiple entries affect the same resource, we don't promise what > > happens. > > > > (But sadly, we can't impose that rule on existing software after the > > fact.) > > It may thus not be worth it to make such a rule. Ack. Perhaps we could recommend it, though. (At the very least, avoiding writing redundant entries would be a little more efficient for the user.) > > > > > > One final note: I have not provided any way to indicate that all those > > entries control the same hardware resource. The common "MB" prefix is > > intended as a clue, but ultimately, userspace needs to know what an > > entry controls before tweaking it. > > > > We could try to describe the relationships explicitly, but I'm not sure > > that it is useful... > > What other relationships should we consider for MPAM? I see that each > MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual, > ... ? Is it expected that MPAM's support of these should be exposed via resctrl? Probably not. These are best regarded as entirely separate instances of MPAM; the PARTID spaces are separate. The Non-secure physical address space is the only physical address space directly accessible to Linux -- for the others, we can't address the MMIO registers anyway. For now, the other address spaces are the firmware's problem. > Have you considered how to express if user wants hardware to have different > allocations for, for example, same PARTID at different execution levels? > > Reinette > > [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ MPAM doesn't allow different controls for a PARTID depending on the exception level, but it is possible to program different PARTIDs for hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0). I think that if we wanted to go down that road, we would want to expose additional "task IDs" in resctrlfs that can be placed into groups independently, say echo 14161:kernel >>.../some_group/tasks echo 14161:user >>.../other_group/tasks However, inside the kernel, the boundary between work done on behalf of a specific userspace task, work done on behalf of userspace in general, and autonomous work inside the kernel is fuzzy and not well defined. For this reason, we currently only configure the PARTID for EL0. For EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0). Hopefully this is orthogonal to the discussion of schema descriptions, though ...? Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-13 14:36 ` Dave Martin @ 2025-10-14 22:55 ` Reinette Chatre 2025-10-15 15:47 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-10-14 22:55 UTC (permalink / raw) To: Dave Martin Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 10/13/25 7:36 AM, Dave Martin wrote: > Hi Reinette, > > On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote: >> Hi Dave, >> >> On 9/30/25 8:40 AM, Dave Martin wrote: >>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: >>>> On 9/29/25 6:56 AM, Dave Martin wrote: > > [...] > >>>> 1) Commented schema are "inactive" >>>> This is unclear to me. In the MB example the commented lines show the >>>> finer grained controls. Since the original MB resource is an approximation >>>> and the hardware must already be configured to support it, would the #-prefixed >>>> lines not show the actual "active" configuration? >>> >>> They would show the active configuration (possibly more precisely than >>> "MB" does). >> >> That is how I see it also. This is specific to MB as we try to maintain >> backward compatibility. >> >> If we are going to make user interface changes to resource allocation then >> ideally it should consider all known future usage. I am trying to navigate >> and understand the discussion on how resctrl can support MPAM and this >> RDT region aware requirements. >> >> I scanned the MPAM spec and from what I understand a resource may support >> multiple controls at the same time, each with its own properties, and then >> there was this: >> >> When multiple partitioning controls are active, each affects the partition’s >> bandwidth usage. However, some combinations of controls may not make sense, >> because the regulation of that pair of controls cannot be made to work in concert. >> >> resctrl may thus present an "active configuration" that is not a configuration >> that "makes sense" ... this may be ok as resctrl would present what hardware >> supports combined with what user requested. > > This is analogous to what the MPAM spec says, though if resctrl offers > two different schemata for the same hardware control, the control cannot be > configured with both values simultaneously. > > For the MPAM hardware controls affecting the same hardware resource, > they can be programmed to combinations of values that have no sensible > interpretation, and the values can be read back just fine. The > performance effects may not be what the user expected / wanted, but > this is not directly visible to resctrl. > > So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user > can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that > will read back just as programmed. The architecture does not promise > what the performance effect of this will be, but resctrl does not need > to care. The same appears to be true for Intel RDT where the spec warns ("Undesirable and undefined performance effects may result if cap programming guidelines are not followed.") but does not seem to prevent such configurations. ... >>> This proposal is only about numeric controls. L3 wouldn't change, but >>> we could still add info/ metadata for bitmap control at the same time >>> as adding it for numeric controls. >> >> I think we should. At least we should leave space for such an addition since >> it is not obvious to me how multiple resources with different controls or >> single resource with multiple controls should be communicated to user space. >> >> To be specific, the original proposal [1] introduced a set of files for >> a numeric control and that seems to work for existing and upcoming >> schema that need a value in a range. Different controls need different >> parameters so to integrate this solution I think it needs another parameter >> (presented as a directory, a file, or within a file) that indicates the >> type of the control so that user space knows which files/parameters to expect >> and how to interpret them. > > Agreed. I wasn't meaning to imply that this proposal shouldn't be > integrated into something more general. If we want a richer > description than the current one, it makes sense to incorporate bitmap > controls -- this just wasn't my focus. Understood. > >> Since different controls have different parameters we need to consider >> whether it is easier to create/parse unique files for each control or >> present all the parameters within one file with another file noting the type >> of control. > > Separate files works quite well for low-tech tooling built using shell > scripts, and this seems to follow the sysfs philosophy. Since there is > no need to keep re-reading these parameters, simplicity feels more > important than efficiency? > > But we could equally have a single file with multiple pieces of > information in it. > > I don't have a strong view on this. If by sysfs philosophy you men "one value per file" then resctrl split from that from the beginning (with the schemata file). I am also not advocating for one or the other at this time but believe we have some flexibility when faced with implementation options/challenges. > >> I understand the files/parameters are intended to be in the schema's info directory >> but how this will look is not obvious to me. Part of the MPAM refactoring transitioned >> the top level info directories to represent the schema entries that currently reflect >> the resources. When we start having multiple schema entries (multiple controls) for a >> single resource the simplest implementation may result in a top level info >> directory for every schema entry ... but the expectation is that these top >> level directories should be per resource, no? > > I had not considered that the info/ directories correspond to resources, > not individual schemata... >> >> At this time I am envisioning the proposal to result in something like below where >> there is one resource directory and one directory per schema entry with a (added by me) >> "schema_type" file to help user find out what the schema type is to know which files are present: >> >> MB >> ├── bandwidth_gran >> ├── delay_linear >> ├── MB >> │ ├── map >> │ ├── max >> │ ├── min >> │ ├── scale >> │ ├── schema_type >> │ └── unit >> ├── MB_HW >> │ ├── map >> │ ├── max >> │ ├── min >> │ ├── scale >> │ ├── schema_type >> │ └── unit >> ├── MB_MAX >> │ └── tbd >> ├── MB_MIN >> │ └── tbd >> ├── min_bandwidth >> ├── num_closids >> └── thread_throttle_mode > > I see no reason not to do that. Either way, older userspace just > ignores the new files and directories. > > Perhaps add an intermediate subdirectory to clarify the relationship > between the resource dir and the individual schema descriptions? > > This may also avoid the new descriptions getting mixed up with the old > description files. > > Say, > > info > ├── MB > │ ├── resource_schemata > │ │ ├── MB > │ │ │ ├── map > │ │ │ ├── max > │ ┆ │ ├── min > │ │ ┆ > ┆ │ > ├── MB_HW > │ ├── map > │ ┆ > ┆ Looks good to me. > >> Something else related to control that caught my eye in MPAM spec is this gem: >> MPAM provides discoverable vendor extensions to permit partners >> to invent partitioning controls. > > Yup. > > Since we have no way to know what vendor-specific controls look like or > what they mean, we can't do much about this. > > So, it's the vendor's job to implement support for it, and we might > still say no (if there is no sane way to integrate it). ack. > >>> MB may be hard to describe in a useful way, though -- at least in the >>> MPAM case, where the number of steps does not divide into 100, and the >>> AMD cases where the meaning of the MB control values is different. >> >> Above I do assume that MB would be represented in a new interface since it >> is a schema entry, if that causes trouble then we could drop it. > > Since MB is described by the existing files and the documentation, > perhaps this it doesn't need an additional description. > > Alternatively though, could we just have a special schema_type for this, > and omit the other properties? This would mean that we at least have > an entry for every schema. We could do this, yes. >>> MB and MB_HW are not interchangeable. To obtain predictable results >>> from MB, userspace would need to know precisely how the kernel is going >>> to round the value. This feels like an implementation detail that >>> doesn't belong in the ABI. >> >> ack >> >> ... >> >>> Anyway, going back to the "#" convention: >>> >>> If the initial read of schemata has the new entries "pre-commented", >>> then userspace wouldn't need to know about the new entries. It could >>> just tweak the MB entry (which it knows about), and write the file back: >>> >>> MB: 0=43 >>> # MB_HW: 0=2 >>> # MB_MIN: 0=1 >>> # MB_MAX: 0=2 >>> >>> then resctrl knows to ignore the hashed lines, and so reading the file >>> back gives: >>> >>> MB: 0=43 >>> # MB_HW: 0=3 >>> # MB_MIN: 0=2 >>> # MB_MAX: 0=3 >> >> Thank you for the example. This seems reasonable. I would like to go back >> to what you wrote in [1]: >> >>> Software that understands the new entries can uncomment the conflicting >>> entries and write them back instead of (or in addition to) the >>> conflicting entries. For example, userspace might write the following: >>> >>> MB_MIN: 0=16, 1=16 >>> MB_MAX: 0=32, 1=32 >>> >>> Which might then read back as follows: >>> >>> MB: 0=50, 1=50 >>> # MB_HW: 0=32, 1=32 >>> # MB_MIN: 0=16, 1=16 >>> # MB_MAX: 0=32, 1=32 >> >> Could/should resctrl uncomment the lines after userspace modified them? > > The '#' wasn't meant to be a state that gets turned on and off. Thank you for clarifying. > Rather, userspace would use this to indicate which entries are > intentionally being modified. I see. I assume that we should not see many of these '#' entries and expect the ones we do see to shadow the legacy schemata entries. New schemata entries (that do not shadow legacy ones) should not have the '#' prefix even if their initial support does not include all controls. > So long as the entries affecting a single resource are ordered so that > each entry is strictly more specific than the previous entries (as > illustrated above), then reading schemata and stripping all the hashes > would allow a previous configuration to be restored; to change just one > entry, userspace can uncomment just that one, or write only that entry > (which is what I think we should recommend for new software). This is a good rule of thumb. > >>> (For hardware-specific reasons, the MPAM driver currently internally >>> programs the MIN bound to be a bit less than the MAX bound, when >>> userspace writes an "MB" entry into schemata. The key thing is that >>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the >>> resctrl level, I don't that that we necessarily need to make promises >>> about what they can change _to_. The exact effect of MIN and MAX >>> bounds is likely to be hardware-dependent anyway.) >> >> MPAM has the "HARDLIM" distinction associated with these MAX values >> and from what I can tell this is per PARTID. Is this something that needs >> to be supported? To do this resctrl will need to support modifying >> control properties per resource group. > > Possibly. Since this is a boolean control that determines how the > MBW_MAX control is applied, we could perhaps present it as an > additional schema -- if so, it's basically orthogonal. > > | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...] > > or > > | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...] > > Does this look reasonable? It does. > > I don't know whether we have a clear use case for this today, and like > almost everything else in MPAM, implementing it is optional... > > >>> Regarding new userspce: >>> >>> Going forward, we can explicitly document that there should be no >>> conflicting or "passenger" entries in a schemata write: don't include >>> an entry for somehing that you don't explicitly want to set, and if >>> multiple entries affect the same resource, we don't promise what >>> happens. >>> >>> (But sadly, we can't impose that rule on existing software after the >>> fact.) >> >> It may thus not be worth it to make such a rule. > > Ack. Perhaps we could recommend it, though. We could, yes. > > (At the very least, avoiding writing redundant entries would be a > little more efficient for the user.) > >>> >>> >>> One final note: I have not provided any way to indicate that all those >>> entries control the same hardware resource. The common "MB" prefix is >>> intended as a clue, but ultimately, userspace needs to know what an >>> entry controls before tweaking it. >>> >>> We could try to describe the relationships explicitly, but I'm not sure >>> that it is useful... >> >> What other relationships should we consider for MPAM? I see that each >> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual, >> ... ? Is it expected that MPAM's support of these should be exposed via resctrl? > > Probably not. These are best regarded as entirely separate instances > of MPAM; the PARTID spaces are separate. The Non-secure physical > address space is the only physical address space directly accessible to > Linux -- for the others, we can't address the MMIO registers anyway. > > For now, the other address spaces are the firmware's problem. Thank you. > >> Have you considered how to express if user wants hardware to have different >> allocations for, for example, same PARTID at different execution levels? >> >> Reinette >> >> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ > > MPAM doesn't allow different controls for a PARTID depending on the > exception level, but it is possible to program different PARTIDs for > hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0). I misunderstood this from the spec. Thank you for clarifying. > > I think that if we wanted to go down that road, we would want to expose > additional "task IDs" in resctrlfs that can be placed into groups > independently, say > > echo 14161:kernel >>.../some_group/tasks > echo 14161:user >>.../other_group/tasks > > However, inside the kernel, the boundary between work done on behalf of > a specific userspace task, work done on behalf of userspace in general, > and autonomous work inside the kernel is fuzzy and not well defined. > > For this reason, we currently only configure the PARTID for EL0. For > EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0). > > Hopefully this is orthogonal to the discussion of schema descriptions, > though ...? Yes. Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-14 22:55 ` Reinette Chatre @ 2025-10-15 15:47 ` Dave Martin 2025-10-15 18:48 ` Luck, Tony 2025-10-16 16:31 ` Reinette Chatre 0 siblings, 2 replies; 52+ messages in thread From: Dave Martin @ 2025-10-15 15:47 UTC (permalink / raw) To: Reinette Chatre Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote: > Hi Dave, > > On 10/13/25 7:36 AM, Dave Martin wrote: > > Hi Reinette, > > > > On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote: > >> Hi Dave, > >> > >> On 9/30/25 8:40 AM, Dave Martin wrote: > >>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: > >>>> On 9/29/25 6:56 AM, Dave Martin wrote: > > > > [...] > > > >>>> 1) Commented schema are "inactive" > >>>> This is unclear to me. In the MB example the commented lines show the > >>>> finer grained controls. Since the original MB resource is an approximation > >>>> and the hardware must already be configured to support it, would the #-prefixed > >>>> lines not show the actual "active" configuration? > >>> > >>> They would show the active configuration (possibly more precisely than > >>> "MB" does). > >> > >> That is how I see it also. This is specific to MB as we try to maintain > >> backward compatibility. > >> > >> If we are going to make user interface changes to resource allocation then > >> ideally it should consider all known future usage. I am trying to navigate > >> and understand the discussion on how resctrl can support MPAM and this > >> RDT region aware requirements. > >> > >> I scanned the MPAM spec and from what I understand a resource may support > >> multiple controls at the same time, each with its own properties, and then > >> there was this: > >> > >> When multiple partitioning controls are active, each affects the partition’s > >> bandwidth usage. However, some combinations of controls may not make sense, > >> because the regulation of that pair of controls cannot be made to work in concert. > >> > >> resctrl may thus present an "active configuration" that is not a configuration > >> that "makes sense" ... this may be ok as resctrl would present what hardware > >> supports combined with what user requested. > > > > This is analogous to what the MPAM spec says, though if resctrl offers > > two different schemata for the same hardware control, the control cannot be > > configured with both values simultaneously. > > > > For the MPAM hardware controls affecting the same hardware resource, > > they can be programmed to combinations of values that have no sensible > > interpretation, and the values can be read back just fine. The > > performance effects may not be what the user expected / wanted, but > > this is not directly visible to resctrl. > > > > So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user > > can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that > > will read back just as programmed. The architecture does not promise > > what the performance effect of this will be, but resctrl does not need > > to care. > > The same appears to be true for Intel RDT where the spec warns ("Undesirable > and undefined performance effects may result if cap programming guidelines > are not followed.") but does not seem to prevent such configurations. Right. We _could_ block such a configuration from reaching the hardware, if the arch backend overrides the MIN limit when the MAX limit is written and vice-versa, when not doing to would result in crossed-over bounds. If software wants to program both bounds, then that would be fine: in: # cat <<-EOF >/sys/fs/resctrl/schemata MB_MAX: 0=128 EOF # cat <<-EOF >/sys/fs/resctrl/schemata MB_MIN: 0=256 MB_MAX: 0=1024 EOF ... internally programming some value >=256 before programming the hardware with the new min bound would not stop the final requested change to MB_MAX from working as userspace expected. (There will be inevitable regulation glitches unless the hardware provides a way to program both bounds atomically. MPAM doesn't; I don't think RDT does either?) But we only _need_ to do this if the hardware architecture forbids programming cross bounds or says that it is unsafe to do so. So, I am thinking that the generic code doesn't need to handle this. [...] > >> To be specific, the original proposal [1] introduced a set of files for > >> a numeric control and that seems to work for existing and upcoming > >> schema that need a value in a range. Different controls need different > >> parameters so to integrate this solution I think it needs another parameter > >> (presented as a directory, a file, or within a file) that indicates the > >> type of the control so that user space knows which files/parameters to expect > >> and how to interpret them. > > > > Agreed. I wasn't meaning to imply that this proposal shouldn't be > > integrated into something more general. If we want a richer > > description than the current one, it makes sense to incorporate bitmap > > controls -- this just wasn't my focus. > > Understood. > > > > >> Since different controls have different parameters we need to consider > >> whether it is easier to create/parse unique files for each control or > >> present all the parameters within one file with another file noting the type > >> of control. > > > > Separate files works quite well for low-tech tooling built using shell > > scripts, and this seems to follow the sysfs philosophy. Since there is > > no need to keep re-reading these parameters, simplicity feels more > > important than efficiency? > > > > But we could equally have a single file with multiple pieces of > > information in it. > > > > I don't have a strong view on this. > > If by sysfs philosophy you men "one value per file" then resctrl split from that from > the beginning (with the schemata file). I am also not advocating for one or the other > at this time but believe we have some flexibility when faced with implementation > options/challenges. Agreed -- it works either way. [...] > >> At this time I am envisioning the proposal to result in something like below where > >> there is one resource directory and one directory per schema entry with a (added by me) > >> "schema_type" file to help user find out what the schema type is to know which files are present: > >> > >> MB > >> ├── bandwidth_gran > >> ├── delay_linear > >> ├── MB > >> │ ├── map > >> │ ├── max > >> │ ├── min > >> │ ├── scale > >> │ ├── schema_type > >> │ └── unit > >> ├── MB_HW > >> │ ├── map [...] > >> ├── min_bandwidth > >> ├── num_closids > >> └── thread_throttle_mode > > > > I see no reason not to do that. Either way, older userspace just > > ignores the new files and directories. > > > > Perhaps add an intermediate subdirectory to clarify the relationship > > between the resource dir and the individual schema descriptions? > > > > This may also avoid the new descriptions getting mixed up with the old > > description files. > > > > Say, > > > > info > > ├── MB > > │ ├── resource_schemata > > │ │ ├── MB > > │ │ │ ├── map > > │ │ │ ├── max > > │ ┆ │ ├── min > > │ │ ┆ > > ┆ │ > > ├── MB_HW > > │ ├── map > > │ ┆ > > ┆ > > Looks good to me. OK > >> Something else related to control that caught my eye in MPAM spec is this gem: > >> MPAM provides discoverable vendor extensions to permit partners > >> to invent partitioning controls. > > > > Yup. > > > > Since we have no way to know what vendor-specific controls look like or > > what they mean, we can't do much about this. > > > > So, it's the vendor's job to implement support for it, and we might > > still say no (if there is no sane way to integrate it). > > ack. > > > > >>> MB may be hard to describe in a useful way, though -- at least in the > >>> MPAM case, where the number of steps does not divide into 100, and the > >>> AMD cases where the meaning of the MB control values is different. > >> > >> Above I do assume that MB would be represented in a new interface since it > >> is a schema entry, if that causes trouble then we could drop it. > > > > Since MB is described by the existing files and the documentation, > > perhaps this it doesn't need an additional description. > > > > Alternatively though, could we just have a special schema_type for this, > > and omit the other properties? This would mean that we at least have > > an entry for every schema. > > We could do this, yes. I guess I'll go with this approach, then, and see if anyone objects. [...] > >>> MB: 0=50, 1=50 > >>> # MB_HW: 0=32, 1=32 > >>> # MB_MIN: 0=16, 1=16 > >>> # MB_MAX: 0=32, 1=32 > >> > >> Could/should resctrl uncomment the lines after userspace modified them? > > > > The '#' wasn't meant to be a state that gets turned on and off. > > Thank you for clarifying. > > > Rather, userspace would use this to indicate which entries are > > intentionally being modified. > > I see. I assume that we should not see many of these '#' entries and expect > the ones we do see to shadow the legacy schemata entries. New schemata entries > (that do not shadow legacy ones) should not have the '#' prefix even if > their initial support does not include all controls. > > So long as the entries affecting a single resource are ordered so that > > each entry is strictly more specific than the previous entries (as > > illustrated above), then reading schemata and stripping all the hashes > > would allow a previous configuration to be restored; to change just one > > entry, userspace can uncomment just that one, or write only that entry > > (which is what I think we should recommend for new software). > > This is a good rule of thumb. To avoid printing entries in the wrong order, do we want to track some parent/child relationship between schemata. In the above example, * MB is the parent of MB_HW; * MB_HW is the parent of MB_MIN and MB_MAX. (for MPAM, at least). When schemata is read, parents should always be printed before their child schemata. But really, we just need to make sure that the rdt_schema_all list is correctly ordered. Do you think that this relationship needs to be reported to userspace? Since the "#" convention is for backward compatibility, maybe we should not use this for new schemata, and place the burden of managing conflicts onto userspace going forward. What do you think? > >>> (For hardware-specific reasons, the MPAM driver currently internally > >>> programs the MIN bound to be a bit less than the MAX bound, when > >>> userspace writes an "MB" entry into schemata. The key thing is that > >>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the > >>> resctrl level, I don't that that we necessarily need to make promises > >>> about what they can change _to_. The exact effect of MIN and MAX > >>> bounds is likely to be hardware-dependent anyway.) > >> > >> MPAM has the "HARDLIM" distinction associated with these MAX values > >> and from what I can tell this is per PARTID. Is this something that needs > >> to be supported? To do this resctrl will need to support modifying > >> control properties per resource group. > > > > Possibly. Since this is a boolean control that determines how the > > MBW_MAX control is applied, we could perhaps present it as an > > additional schema -- if so, it's basically orthogonal. > > > > | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...] > > > > or > > > > | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...] > > > > Does this look reasonable? > > It does. OK -- note, I don't think we have any immediate plan to support this in the MPAM driver, but it may land eventually in some form. [...] > >>> Regarding new userspce: > >>> > >>> Going forward, we can explicitly document that there should be no > >>> conflicting or "passenger" entries in a schemata write: don't include > >>> an entry for somehing that you don't explicitly want to set, and if > >>> multiple entries affect the same resource, we don't promise what > >>> happens. > >>> > >>> (But sadly, we can't impose that rule on existing software after the > >>> fact.) > >> > >> It may thus not be worth it to make such a rule. > > > > Ack. Perhaps we could recommend it, though. > > We could, yes. OK [...] > >> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual, > >> ... ? Is it expected that MPAM's support of these should be exposed via resctrl? > > > > Probably not. These are best regarded as entirely separate instances > > of MPAM; the PARTID spaces are separate. The Non-secure physical > > address space is the only physical address space directly accessible to > > Linux -- for the others, we can't address the MMIO registers anyway. > > > > For now, the other address spaces are the firmware's problem. > > Thank you. No worries -- it's not too obvious from the spec! > >> Have you considered how to express if user wants hardware to have different > >> allocations for, for example, same PARTID at different execution levels? > >> > >> Reinette > >> > >> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ > > > > MPAM doesn't allow different controls for a PARTID depending on the > > exception level, but it is possible to program different PARTIDs for > > hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0). > > I misunderstood this from the spec. Thank you for clarifying. > > > > > I think that if we wanted to go down that road, we would want to expose > > additional "task IDs" in resctrlfs that can be placed into groups > > independently, say > > > > echo 14161:kernel >>.../some_group/tasks > > echo 14161:user >>.../other_group/tasks > > > > However, inside the kernel, the boundary between work done on behalf of > > a specific userspace task, work done on behalf of userspace in general, > > and autonomous work inside the kernel is fuzzy and not well defined. > > > > For this reason, we currently only configure the PARTID for EL0. For > > EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0). > > > > Hopefully this is orthogonal to the discussion of schema descriptions, > > though ...? > > Yes. OK; I suggest that we put this on one side, for now, then. There is a discussion to be had on this, but it feels like a separate thing. I'll try to pull the state of this discussion together -- maybe as a draft update to the documentation, describing the interface as proposed so far. Does that work for you? Cheers --Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-15 15:47 ` Dave Martin @ 2025-10-15 18:48 ` Luck, Tony 2025-10-16 14:50 ` Dave Martin 2025-10-16 16:31 ` Reinette Chatre 1 sibling, 1 reply; 52+ messages in thread From: Luck, Tony @ 2025-10-15 18:48 UTC (permalink / raw) To: Dave Martin, Chatre, Reinette Cc: linux-kernel@vger.kernel.org, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86@kernel.org, linux-doc@vger.kernel.org > I'll try to pull the state of this discussion together -- maybe as a > draft update to the documentation, describing the interface as proposed > so far. Does that work for you? Dave, Yes please. This discussion has explored a bunch of options. A summary would be perfect. -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-15 18:48 ` Luck, Tony @ 2025-10-16 14:50 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-10-16 14:50 UTC (permalink / raw) To: Luck, Tony Cc: Chatre, Reinette, linux-kernel@vger.kernel.org, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86@kernel.org, linux-doc@vger.kernel.org Hi Tony, On Wed, Oct 15, 2025 at 06:48:55PM +0000, Luck, Tony wrote: > > I'll try to pull the state of this discussion together -- maybe as a > > draft update to the documentation, describing the interface as proposed > > so far. Does that work for you? > > Dave, > > Yes please. This discussion has explored a bunch of options. A summary > would be perfect. OK -- next week sometime, probably. I'll try to keep it high-level, since we don't have a final design yet... Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-15 15:47 ` Dave Martin 2025-10-15 18:48 ` Luck, Tony @ 2025-10-16 16:31 ` Reinette Chatre 2025-10-17 14:17 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-10-16 16:31 UTC (permalink / raw) To: Dave Martin Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 10/15/25 8:47 AM, Dave Martin wrote: > Hi Reinette, > > On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote: >> Hi Dave, >> >> On 10/13/25 7:36 AM, Dave Martin wrote: >>> Hi Reinette, >>> >>> On Fri, Oct 10, 2025 at 09:48:21AM -0700, Reinette Chatre wrote: >>>> Hi Dave, >>>> >>>> On 9/30/25 8:40 AM, Dave Martin wrote: >>>>> On Mon, Sep 29, 2025 at 09:09:35AM -0700, Reinette Chatre wrote: >>>>>> On 9/29/25 6:56 AM, Dave Martin wrote: >>> >>> [...] >>> >>>>>> 1) Commented schema are "inactive" >>>>>> This is unclear to me. In the MB example the commented lines show the >>>>>> finer grained controls. Since the original MB resource is an approximation >>>>>> and the hardware must already be configured to support it, would the #-prefixed >>>>>> lines not show the actual "active" configuration? >>>>> >>>>> They would show the active configuration (possibly more precisely than >>>>> "MB" does). >>>> >>>> That is how I see it also. This is specific to MB as we try to maintain >>>> backward compatibility. >>>> >>>> If we are going to make user interface changes to resource allocation then >>>> ideally it should consider all known future usage. I am trying to navigate >>>> and understand the discussion on how resctrl can support MPAM and this >>>> RDT region aware requirements. >>>> >>>> I scanned the MPAM spec and from what I understand a resource may support >>>> multiple controls at the same time, each with its own properties, and then >>>> there was this: >>>> >>>> When multiple partitioning controls are active, each affects the partition’s >>>> bandwidth usage. However, some combinations of controls may not make sense, >>>> because the regulation of that pair of controls cannot be made to work in concert. >>>> >>>> resctrl may thus present an "active configuration" that is not a configuration >>>> that "makes sense" ... this may be ok as resctrl would present what hardware >>>> supports combined with what user requested. >>> >>> This is analogous to what the MPAM spec says, though if resctrl offers >>> two different schemata for the same hardware control, the control cannot be >>> configured with both values simultaneously. >>> >>> For the MPAM hardware controls affecting the same hardware resource, >>> they can be programmed to combinations of values that have no sensible >>> interpretation, and the values can be read back just fine. The >>> performance effects may not be what the user expected / wanted, but >>> this is not directly visible to resctrl. >>> >>> So, if we offer independent schemata for MBW_MIN and MBW_MAX, the user >>> can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that >>> will read back just as programmed. The architecture does not promise >>> what the performance effect of this will be, but resctrl does not need >>> to care. >> >> The same appears to be true for Intel RDT where the spec warns ("Undesirable >> and undefined performance effects may result if cap programming guidelines >> are not followed.") but does not seem to prevent such configurations. > > Right. We _could_ block such a configuration from reaching the hardware, > if the arch backend overrides the MIN limit when the MAX limit is > written and vice-versa, when not doing to would result in crossed-over > bounds. > > If software wants to program both bounds, then that would be fine: in: > > # cat <<-EOF >/sys/fs/resctrl/schemata > MB_MAX: 0=128 > EOF > > # cat <<-EOF >/sys/fs/resctrl/schemata > MB_MIN: 0=256 > MB_MAX: 0=1024 > EOF > > ... internally programming some value >=256 before programming the > hardware with the new min bound would not stop the final requested > change to MB_MAX from working as userspace expected. > > (There will be inevitable regulation glitches unless the hardware > provides a way to program both bounds atomically. MPAM doesn't; I > don't think RDT does either?) > > > But we only _need_ to do this if the hardware architecture forbids > programming cross bounds or says that it is unsafe to do so. So, I am > thinking that the generic code doesn't need to handle this. > > [...] Sounds reasonable to me. ... >>>>> MB: 0=50, 1=50 >>>>> # MB_HW: 0=32, 1=32 >>>>> # MB_MIN: 0=16, 1=16 >>>>> # MB_MAX: 0=32, 1=32 >>>> >>>> Could/should resctrl uncomment the lines after userspace modified them? >>> >>> The '#' wasn't meant to be a state that gets turned on and off. >> >> Thank you for clarifying. >> >>> Rather, userspace would use this to indicate which entries are >>> intentionally being modified. >> >> I see. I assume that we should not see many of these '#' entries and expect >> the ones we do see to shadow the legacy schemata entries. New schemata entries >> (that do not shadow legacy ones) should not have the '#' prefix even if >> their initial support does not include all controls. >>> So long as the entries affecting a single resource are ordered so that >>> each entry is strictly more specific than the previous entries (as >>> illustrated above), then reading schemata and stripping all the hashes >>> would allow a previous configuration to be restored; to change just one >>> entry, userspace can uncomment just that one, or write only that entry >>> (which is what I think we should recommend for new software). >> >> This is a good rule of thumb. > > To avoid printing entries in the wrong order, do we want to track some > parent/child relationship between schemata. > > In the above example, > > * MB is the parent of MB_HW; > > * MB_HW is the parent of MB_MIN and MB_MAX. > > (for MPAM, at least). Could you please elaborate this relationship? I envisioned the MB_HW to be something similar to Intel RDT's "optimal" bandwidth setting ... something that is expected to be somewhere between the "min" and the "max". But, now I think I'm a bit lost in MPAM since it is not clear to me what MB_HW represents ... would this be the "memory bandwidth portion partitioning"? Although, that uses a completely different format from "min" and "max". > > When schemata is read, parents should always be printed before their > child schemata. But really, we just need to make sure that the > rdt_schema_all list is correctly ordered. > > > Do you think that this relationship needs to be reported to userspace? You brought up the topic of relationships in https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me to learn more from the MPAM spec where I learned and went on tangent about all the other possible namespaces without circling back. I was hoping that the namespace prefix would make the relationships clear, something like <resource>_<control>, but I did not expect another layer in the hierarchy like your example above. The idea of "parent" and "child" is also not obvious to me at this point. resctrl gives us a "resource" to start with and we are now discussing multiple controls per resource. Could you please elaborate what you see as "parent" and "child"? We do have the info directory available to express relationships and a hierarchy is already starting to taking shape there. > > Since the "#" convention is for backward compatibility, maybe we should > not use this for new schemata, and place the burden of managing > conflicts onto userspace going forward. What do you think? I agree. The way I understand this is that the '#' will only be used for new controls that shadow the default/current controls of the legacy resources. I do not expect that the prefix will be needed for new resources, even if the initial support of a new resource does not include all possible controls. >>>>> (For hardware-specific reasons, the MPAM driver currently internally >>>>> programs the MIN bound to be a bit less than the MAX bound, when >>>>> userspace writes an "MB" entry into schemata. The key thing is that >>>>> writing MB may cause the MB_MIN/MB_MAX entries to change -- at the >>>>> resctrl level, I don't that that we necessarily need to make promises >>>>> about what they can change _to_. The exact effect of MIN and MAX >>>>> bounds is likely to be hardware-dependent anyway.) >>>> >>>> MPAM has the "HARDLIM" distinction associated with these MAX values >>>> and from what I can tell this is per PARTID. Is this something that needs >>>> to be supported? To do this resctrl will need to support modifying >>>> control properties per resource group. >>> >>> Possibly. Since this is a boolean control that determines how the >>> MBW_MAX control is applied, we could perhaps present it as an >>> additional schema -- if so, it's basically orthogonal. >>> >>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...] >>> >>> or >>> >>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...] >>> >>> Does this look reasonable? >> >> It does. > > OK -- note, I don't think we have any immediate plan to support this in > the MPAM driver, but it may land eventually in some form. > ack. ... >>>> MPAM allows per-PARTID configurations for secure/non-secure, physical/virtual, >>>> ... ? Is it expected that MPAM's support of these should be exposed via resctrl? >>> >>> Probably not. These are best regarded as entirely separate instances >>> of MPAM; the PARTID spaces are separate. The Non-secure physical >>> address space is the only physical address space directly accessible to >>> Linux -- for the others, we can't address the MMIO registers anyway. >>> >>> For now, the other address spaces are the firmware's problem. >> >> Thank you. > > No worries -- it's not too obvious from the spec! > >>>> Have you considered how to express if user wants hardware to have different >>>> allocations for, for example, same PARTID at different execution levels? >>>> >>>> Reinette >>>> >>>> [1] https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/ >>> >>> MPAM doesn't allow different controls for a PARTID depending on the >>> exception level, but it is possible to program different PARTIDs for >>> hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0). >> >> I misunderstood this from the spec. Thank you for clarifying. >> >>> >>> I think that if we wanted to go down that road, we would want to expose >>> additional "task IDs" in resctrlfs that can be placed into groups >>> independently, say >>> >>> echo 14161:kernel >>.../some_group/tasks >>> echo 14161:user >>.../other_group/tasks >>> >>> However, inside the kernel, the boundary between work done on behalf of >>> a specific userspace task, work done on behalf of userspace in general, >>> and autonomous work inside the kernel is fuzzy and not well defined. >>> >>> For this reason, we currently only configure the PARTID for EL0. For >>> EL1 (and EL2 if the kernel uses it), we just use the default PARTID (0). >>> >>> Hopefully this is orthogonal to the discussion of schema descriptions, >>> though ...? >> >> Yes. > > OK; I suggest that we put this on one side, for now, then. > > There is a discussion to be had on this, but it feels like a separate > thing. agreed. > > > I'll try to pull the state of this discussion together -- maybe as a > draft update to the documentation, describing the interface as proposed > so far. Does that work for you? It does. Thank you very much for taking this on. Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-16 16:31 ` Reinette Chatre @ 2025-10-17 14:17 ` Dave Martin 2025-10-17 15:59 ` Reinette Chatre 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-10-17 14:17 UTC (permalink / raw) To: Reinette Chatre Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 10/15/25 8:47 AM, Dave Martin wrote: > > Hi Reinette, > > > > On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote: > >> Hi Dave, > >> > >> On 10/13/25 7:36 AM, Dave Martin wrote: [...] > >>> [...] if we offer independent schemata for MBW_MIN and MBW_MAX, the user > >>> can program MBW_MIN=75% and MBW_MAX=25% for the same PARTID, and that > >>> will read back just as programmed. The architecture does not promise > >>> what the performance effect of this will be, but resctrl does not need > >>> to care. > >> > >> The same appears to be true for Intel RDT where the spec warns ("Undesirable > >> and undefined performance effects may result if cap programming guidelines > >> are not followed.") but does not seem to prevent such configurations. > > > > Right. We _could_ block such a configuration from reaching the hardware, > > if the arch backend overrides the MIN limit when the MAX limit is > > written and vice-versa, when not doing to would result in crossed-over > > bounds. > > > > If software wants to program both bounds, then that would be fine: in: > > > > # cat <<-EOF >/sys/fs/resctrl/schemata > > MB_MAX: 0=128 > > EOF > > > > # cat <<-EOF >/sys/fs/resctrl/schemata > > MB_MIN: 0=256 > > MB_MAX: 0=1024 > > EOF > > > > ... internally programming some value >=256 before programming the > > hardware with the new min bound would not stop the final requested > > change to MB_MAX from working as userspace expected. > > > > (There will be inevitable regulation glitches unless the hardware > > provides a way to program both bounds atomically. MPAM doesn't; I > > don't think RDT does either?) > > > > > > But we only _need_ to do this if the hardware architecture forbids > > programming cross bounds or says that it is unsafe to do so. So, I am > > thinking that the generic code doesn't need to handle this. > > > > [...] > > Sounds reasonable to me. OK [...] > >>> So long as the entries affecting a single resource are ordered so that > >>> each entry is strictly more specific than the previous entries (as > >>> illustrated above), then reading schemata and stripping all the hashes > >>> would allow a previous configuration to be restored; to change just one > >>> entry, userspace can uncomment just that one, or write only that entry > >>> (which is what I think we should recommend for new software). > >> > >> This is a good rule of thumb. > > > > To avoid printing entries in the wrong order, do we want to track some > > parent/child relationship between schemata. > > > > In the above example, > > > > * MB is the parent of MB_HW; > > > > * MB_HW is the parent of MB_MIN and MB_MAX. > > > > (for MPAM, at least). > > Could you please elaborate this relationship? I envisioned the MB_HW to be > something similar to Intel RDT's "optimal" bandwidth setting ... something > that is expected to be somewhere between the "min" and the "max". > > But, now I think I'm a bit lost in MPAM since it is not clear to me what > MB_HW represents ... would this be the "memory bandwidth portion > partitioning"? Although, that uses a completely different format from > "min" and "max". I confess that I'm thinking with an MPAM mindset here. Some pseudocode might help to illustrate how these might interact: set_MB(partid, val) { set_MB_HW(partid, percent_to_hw_val(val)); } set_MB_HW(partid, val) { set_MB_MAX(partid, val); /* * Hysteresis to avoid steady flows from ping-ponging * between low and high priority: */ if (hardware_has_MB_MIN()) set_MB_MIN(partid, val * 95%); } set_MB_MIN(partid, val) { mpam->MBW_MIN[partid] = val; } set_MB_MAX(partid, val) { mpam->MBW_MAX[partid] = val; } with get_MB(partid) { return hw_val_to_percent(get_MB_HW(partid)); } get_MB_HW(partid) { return get_MB_MAX(partid); } get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; } get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; } The parent/child relationship I suggested is basically the call-graph of this pseudocode. These could all be exposed as resctrl schemata, but the children provide finer / more broken-down control than the parents. Reading a parent provides a merged or approximated view of the configuration of the child schemata. In particular, set_child(partid, get_child(partid)); get_parent(partid); yields the same result as get_parent(partid); but will not be true in general, if the roles of parent and child are reversed. I think still this holds true if implementing an "MB_HW" schema for newer revisions of RDT. The pseudocode would be different, but there will still be a tree-like call graph (?) Going back to MPAM: Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory bandwidth is split into discrete, non-overlapping chunks, and each PARTID is configured with a bitmap saying which chunks it can use. This could be done by time-slicing, or controlling which memory controllers/ports a PARTID can issue requests to, or something like that. If the MBW_MAX control isn't implemented, then the MPAM current driver maps this bitmap control onto the resctrl "MB" schema in a simple way, but we are considering dropping this, since the allocation model (explicit, static allocation of discrete resources) is not really the same as for RDT MBA (dynamic prioritisation based on recent resource consumption). Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend on an equal footing for memory bandwidth until one exceeds 50% (when it will start to be penalised). Prorgamming bitmaps can't have the same effect. For example, with { 1100, 0110, 0011, 1001 }, no group can use more than 50% of the full bandwidth, whatever happens. Worse, certain pairs of groups are fully isolated from each other, while others are always in contention, not matter how little actual traffic is generated. This is potentially useful, but it's not the same as the MIN/MAX model. So, it may make more sense to expose this as a separate, bitmap schema. (The same goes for "Proportional stride" partitioning. It's another, different, control for memory bandwidth. As of today, I don't think that we have a reference platform for experimenting with either of these.) > > When schemata is read, parents should always be printed before their > > child schemata. But really, we just need to make sure that the > > rdt_schema_all list is correctly ordered. > > > > > > Do you think that this relationship needs to be reported to userspace? > > You brought up the topic of relationships in > https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me > to learn more from the MPAM spec where I learned and went on tangent about all > the other possible namespaces without circling back. > > I was hoping that the namespace prefix would make the relationships clear, > something like <resource>_<control>, but I did not expect another layer in > the hierarchy like your example above. The idea of "parent" and "child" is > also not obvious to me at this point. resctrl gives us a "resource" to start > with and we are now discussing multiple controls per resource. Could you please > elaborate what you see as "parent" and "child"? See above -- the parent/child concept is not an MPAM thing; apologies if I didn't make that clear. > We do have the info directory available to express relationships and a > hierarchy is already starting to taking shape there. I'm wondering whether using a common prefix will be future-proof? It may not always be clear which part of a name counts as the common prefix. There were already discussions about appending a number to a schema name in order to control different memory regions -- that's another prefix/suffix relationship, if so... We could handle all of this by documenting all the relationships explicitly. But I'm thinking that it could be easier for maintanance if the resctrl core code has explicit knowledge of the relationships. That said, using a common prefix is still a good idea. But maybe we shouldn't lean on it too heavily as a way of actually describing the relationships? > > Since the "#" convention is for backward compatibility, maybe we should > > not use this for new schemata, and place the burden of managing > > conflicts onto userspace going forward. What do you think? > > I agree. The way I understand this is that the '#' will only be used for > new controls that shadow the default/current controls of the legacy resources. > I do not expect that the prefix will be needed for new resources, even if > the initial support of a new resource does not include all possible controls. OK. Note, relating this to the above, the # could be interpreted as meaning "this is a child of some other schema; don't mess with it unless you know what you are doing". Older software doesn't understand the relationships, so this is just there to stop it from shooting itself in the foot. [...] > >>>> MPAM has the "HARDLIM" distinction associated with these MAX values > >>>> and from what I can tell this is per PARTID. Is this something that needs > >>>> to be supported? To do this resctrl will need to support modifying > >>>> control properties per resource group. > >>> > >>> Possibly. Since this is a boolean control that determines how the > >>> MBW_MAX control is applied, we could perhaps present it as an > >>> additional schema -- if so, it's basically orthogonal. > >>> > >>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...] > >>> > >>> or > >>> > >>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...] > >>> > >>> Does this look reasonable? > >> > >> It does. > > > > OK -- note, I don't think we have any immediate plan to support this in > > the MPAM driver, but it may land eventually in some form. > > > > ack. (Or, of course, anything else that achieves the same goal...) [...] > >>> MPAM doesn't allow different controls for a PARTID depending on the > >>> exception level, but it is possible to program different PARTIDs for > >>> hypervisor/kernel and userspace (i.e., EL2/EL1 and EL0). [...] > >>> Hopefully this is orthogonal to the discussion of schema descriptions, > >>> though ...? > >> > >> Yes. > > > > OK; I suggest that we put this on one side, for now, then. > > > > There is a discussion to be had on this, but it feels like a separate > > thing. > > agreed. > > > > > > > I'll try to pull the state of this discussion together -- maybe as a > > draft update to the documentation, describing the interface as proposed > > so far. Does that work for you? > > It does. Thank you very much for taking this on. > > Reinette OK, I'll aim to follow up on this next week. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-17 14:17 ` Dave Martin @ 2025-10-17 15:59 ` Reinette Chatre 2025-10-20 15:50 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-10-17 15:59 UTC (permalink / raw) To: Dave Martin Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On 10/17/25 7:17 AM, Dave Martin wrote: > Hi Reinette, > > On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote: >> Hi Dave, >> >> On 10/15/25 8:47 AM, Dave Martin wrote: >>> Hi Reinette, >>> >>> On Tue, Oct 14, 2025 at 03:55:40PM -0700, Reinette Chatre wrote: >>>> Hi Dave, >>>> >>>> On 10/13/25 7:36 AM, Dave Martin wrote: ... >>>>> So long as the entries affecting a single resource are ordered so that >>>>> each entry is strictly more specific than the previous entries (as >>>>> illustrated above), then reading schemata and stripping all the hashes >>>>> would allow a previous configuration to be restored; to change just one >>>>> entry, userspace can uncomment just that one, or write only that entry >>>>> (which is what I think we should recommend for new software). >>>> >>>> This is a good rule of thumb. >>> >>> To avoid printing entries in the wrong order, do we want to track some >>> parent/child relationship between schemata. >>> >>> In the above example, >>> >>> * MB is the parent of MB_HW; >>> >>> * MB_HW is the parent of MB_MIN and MB_MAX. >>> >>> (for MPAM, at least). >> >> Could you please elaborate this relationship? I envisioned the MB_HW to be >> something similar to Intel RDT's "optimal" bandwidth setting ... something >> that is expected to be somewhere between the "min" and the "max". >> >> But, now I think I'm a bit lost in MPAM since it is not clear to me what >> MB_HW represents ... would this be the "memory bandwidth portion >> partitioning"? Although, that uses a completely different format from >> "min" and "max". > > I confess that I'm thinking with an MPAM mindset here. > > Some pseudocode might help to illustrate how these might interact: > > set_MB(partid, val) { > set_MB_HW(partid, percent_to_hw_val(val)); > } > > set_MB_HW(partid, val) { > set_MB_MAX(partid, val); > > /* > * Hysteresis to avoid steady flows from ping-ponging > * between low and high priority: > */ > if (hardware_has_MB_MIN()) > set_MB_MIN(partid, val * 95%); > } > > set_MB_MIN(partid, val) { > mpam->MBW_MIN[partid] = val; > } > > set_MB_MAX(partid, val) { > mpam->MBW_MAX[partid] = val; > } > > with > > get_MB(partid) { > return hw_val_to_percent(get_MB_HW(partid)); > } > > get_MB_HW(partid) { return get_MB_MAX(partid); } > > get_MB_MIN(partid) { return mpam->MBW_MIN[partid]; } > > get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; } > > > The parent/child relationship I suggested is basically the call-graph > of this pseudocode. These could all be exposed as resctrl schemata, > but the children provide finer / more broken-down control than the > parents. Reading a parent provides a merged or approximated view of > the configuration of the child schemata. > > In particular, > > set_child(partid, get_child(partid)); > get_parent(partid); > > yields the same result as > > get_parent(partid); > > but will not be true in general, if the roles of parent and child are > reversed. > > I think still this holds true if implementing an "MB_HW" schema for > newer revisions of RDT. The pseudocode would be different, but there > will still be a tree-like call graph (?) Thank you very much for the example. I missed in earlier examples that MB_HW was being controlled via MB_MAX and MB_MIN. I do not expect such a dependence or tree-like call graph for RDT where the closest equivalent (termed "optimal") is programmed independently from min and max. > > > Going back to MPAM: > > Re MPAM memory bandwidth portion partitioning (a.k.a., MBW_PART or > MBWPBM), this is a bitmap-type control, analogous to RDT CAT: memory > bandwidth is split into discrete, non-overlapping chunks, and each > PARTID is configured with a bitmap saying which chunks it can use. > This could be done by time-slicing, or controlling which memory > controllers/ports a PARTID can issue requests to, or something like > that. > > If the MBW_MAX control isn't implemented, then the MPAM current driver > maps this bitmap control onto the resctrl "MB" schema in a simple way, > but we are considering dropping this, since the allocation model > (explicit, static allocation of discrete resources) is not really the > same as for RDT MBA (dynamic prioritisation based on recent resource > consumption). > > Programming MBW_MAX=50% for four PARTIDs means that the PARTIDs contend > on an equal footing for memory bandwidth until one exceeds 50% (when it > will start to be penalised). Prorgamming bitmaps can't have the same > effect. For example, with { 1100, 0110, 0011, 1001 }, no group can use > more than 50% of the full bandwidth, whatever happens. Worse, certain > pairs of groups are fully isolated from each other, while others are > always in contention, not matter how little actual traffic is generated. > This is potentially useful, but it's not the same as the MIN/MAX model. > > So, it may make more sense to expose this as a separate, bitmap schema. > > (The same goes for "Proportional stride" partitioning. It's another, > different, control for memory bandwidth. As of today, I don't think > that we have a reference platform for experimenting with either of > these.) Thank you. > > >>> When schemata is read, parents should always be printed before their >>> child schemata. But really, we just need to make sure that the >>> rdt_schema_all list is correctly ordered. >>> >>> >>> Do you think that this relationship needs to be reported to userspace? >> >> You brought up the topic of relationships in >> https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ that prompted me >> to learn more from the MPAM spec where I learned and went on tangent about all >> the other possible namespaces without circling back. >> >> I was hoping that the namespace prefix would make the relationships clear, >> something like <resource>_<control>, but I did not expect another layer in >> the hierarchy like your example above. The idea of "parent" and "child" is >> also not obvious to me at this point. resctrl gives us a "resource" to start >> with and we are now discussing multiple controls per resource. Could you please >> elaborate what you see as "parent" and "child"? > > See above -- the parent/child concept is not an MPAM thing; apologies > if I didn't make that clear. > >> We do have the info directory available to express relationships and a >> hierarchy is already starting to taking shape there. > > I'm wondering whether using a common prefix will be future-proof? It > may not always be clear which part of a name counts as the common > prefix. Apologies for my cryptic response. I was actually musing that we already discussed using the info directory to express relationships between controls and resources and it does not seem a big leap to expand this to express relationships between controls. Consider something like below for MPAM: info └── MB └── resource_schemata └── MB └── MB_HW ├── MB_MAX └── MB_MIN On RDT it may then look different: info └── MB └── resource_schemata └── MB ├── MB_HW ├── MB_MAX └── MB_MIN Having the resource name as common prefix does seem consistent and makes clear to user space which controls apply to a resource. > > There were already discussions about appending a number to a schema > name in order to control different memory regions -- that's another > prefix/suffix relationship, if so... > > We could handle all of this by documenting all the relationships > explicitly. But I'm thinking that it could be easier for maintanance > if the resctrl core code has explicit knowledge of the relationships. Not just for resctrl self but to make clear to user space which controls impact others and which are independent. > That said, using a common prefix is still a good idea. But maybe we > shouldn't lean on it too heavily as a way of actually describing the > relationships? I do not think we can rely on order in schemata file though. For example, I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either case the schemata may print something like below on both platforms (copied from your original example) where for MPAM it implies a relationship but for RDT it does not: MB: 0=50, 1=50 # MB_HW: 0=32, 1=32 # MB_MIN: 0=31, 1=31 # MB_MAX: 0=32, 1=32 >>> Since the "#" convention is for backward compatibility, maybe we should >>> not use this for new schemata, and place the burden of managing >>> conflicts onto userspace going forward. What do you think? >> >> I agree. The way I understand this is that the '#' will only be used for >> new controls that shadow the default/current controls of the legacy resources. >> I do not expect that the prefix will be needed for new resources, even if >> the initial support of a new resource does not include all possible controls. > > OK. Note, relating this to the above, the # could be interpreted as > meaning "this is a child of some other schema; don't mess with it > unless you know what you are doing". Could it be made more specific to be "this is a child of a legacy schema created before this new format existed; don't mess with it unless you know what you are doing"? That is, any schema created after this new format is established does not need the '#' prefix even if there is a parent/child relationship? > > Older software doesn't understand the relationships, so this is just > there to stop it from shooting itself in the foot. ack. By extension I assume that software that understands a schema that is introduced after the "relationship" format is established can be expected to understand the format and thus these new schemata do not require the '#' prefix. Even if a new schema is introduced with a single control it can be followed by a new child control without a '#' prefix a couple of kernel releases later. By this point it should hopefully be understood by user space that it should not write entries it does not understand. > > [...] > >>>>>> MPAM has the "HARDLIM" distinction associated with these MAX values >>>>>> and from what I can tell this is per PARTID. Is this something that needs >>>>>> to be supported? To do this resctrl will need to support modifying >>>>>> control properties per resource group. >>>>> >>>>> Possibly. Since this is a boolean control that determines how the >>>>> MBW_MAX control is applied, we could perhaps present it as an >>>>> additional schema -- if so, it's basically orthogonal. >>>>> >>>>> | MB_HARDMAX: 0=0, 1=1, 2=1, 3=0 [...] >>>>> >>>>> or >>>>> >>>>> | MB_HARDMAX: 0=off, 1=on, 2=on, 3=off [...] >>>>> >>>>> Does this look reasonable? >>>> >>>> It does. >>> >>> OK -- note, I don't think we have any immediate plan to support this in >>> the MPAM driver, but it may land eventually in some form. >>> >> >> ack. > > (Or, of course, anything else that achieves the same goal...) Right ... I did not dig into syntax that could be made to match existing schema formats etc. that can be filled in later. ... >>> I'll try to pull the state of this discussion together -- maybe as a >>> draft update to the documentation, describing the interface as proposed >>> so far. Does that work for you? >> >> It does. Thank you very much for taking this on. >> >> Reinette > > OK, I'll aim to follow up on this next week. Thank you very much. Reinette ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-17 15:59 ` Reinette Chatre @ 2025-10-20 15:50 ` Dave Martin 2025-10-20 16:31 ` Luck, Tony 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-10-20 15:50 UTC (permalink / raw) To: Reinette Chatre Cc: Luck, Tony, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote: > Hi Dave, > > On 10/17/25 7:17 AM, Dave Martin wrote: > > Hi Reinette, > > > > On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote: > >> Hi Dave, > >> > >> On 10/15/25 8:47 AM, Dave Martin wrote: [...] > >>> To avoid printing entries in the wrong order, do we want to track some > >>> parent/child relationship between schemata. > >>> > >>> In the above example, > >>> > >>> * MB is the parent of MB_HW; > >>> > >>> * MB_HW is the parent of MB_MIN and MB_MAX. > >>> > >>> (for MPAM, at least). > >> > >> Could you please elaborate this relationship? I envisioned the MB_HW to be > >> something similar to Intel RDT's "optimal" bandwidth setting ... something > >> that is expected to be somewhere between the "min" and the "max". > >> > >> But, now I think I'm a bit lost in MPAM since it is not clear to me what > >> MB_HW represents ... would this be the "memory bandwidth portion > >> partitioning"? Although, that uses a completely different format from > >> "min" and "max". > > > > I confess that I'm thinking with an MPAM mindset here. > > > > Some pseudocode might help to illustrate how these might interact: > > > > set_MB(partid, val) { > > set_MB_HW(partid, percent_to_hw_val(val)); [...] > > get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; } > > > > > > The parent/child relationship I suggested is basically the call-graph > > of this pseudocode. These could all be exposed as resctrl schemata, > > but the children provide finer / more broken-down control than the > > parents. Reading a parent provides a merged or approximated view of > > the configuration of the child schemata. > > > > In particular, > > > > set_child(partid, get_child(partid)); > > get_parent(partid); > > > > yields the same result as > > > > get_parent(partid); > > > > but will not be true in general, if the roles of parent and child are > > reversed. > > > > I think still this holds true if implementing an "MB_HW" schema for > > newer revisions of RDT. The pseudocode would be different, but there > > will still be a tree-like call graph (?) > > Thank you very much for the example. I missed in earlier examples that > MB_HW was being controlled via MB_MAX and MB_MIN. > I do not expect such a dependence or tree-like call graph for RDT where > the closest equivalent (termed "optimal") is programmed independently from > min and max. I hadn't realised that this RDT feature as three control thresholds. I'll comment in more detail on your sample info/ hierarchy, below. > > > > Going back to MPAM: [...] > > So, it may make more sense to expose [MBWPBM] as a separate, bitmap schema. > > > > (The same goes for "Proportional stride" partitioning. It's another, > > different, control for memory bandwidth. As of today, I don't think > > that we have a reference platform for experimenting with either of > > these.) > > Thank you. > > > > > > >>> When schemata is read, parents should always be printed before their > >>> child schemata. But really, we just need to make sure that the > >>> rdt_schema_all list is correctly ordered. > >>> > >>> > >>> Do you think that this relationship needs to be reported to userspace? [...] > >> We do have the info directory available to express relationships and a > >> hierarchy is already starting to taking shape there. > > > > I'm wondering whether using a common prefix will be future-proof? It > > may not always be clear which part of a name counts as the common > > prefix. > > Apologies for my cryptic response. I was actually musing that we already > discussed using the info directory to express relationships between > controls and resources and it does not seem a big leap to expand > this to express relationships between controls. Consider something > like below for MPAM: > > info > └── MB > └── resource_schemata > └── MB > └── MB_HW > ├── MB_MAX > └── MB_MIN > > > On RDT it may then look different: > > info > └── MB > └── resource_schemata > └── MB > ├── MB_HW > ├── MB_MAX > └── MB_MIN > > Having the resource name as common prefix does seem consistent and makes > clear to user space which controls apply to a resource. Ack. The above hierarchies make sense, but I wonder whether we should be forcing software to understand the MIN and MAX limits? I can still see a benefit in having MB_HW be a generic, software- defined control, even on RDT. Then, this can always be available, with similar behaviour, on all resctrl instances that support memory bandwidth controls. The precise set of child controls will vary per arch (and on MPAM at least, between different hardware implementations) -- so these look like they will work less well as a generic interface. Considering RDT: to avoid random regulation behaviour, RDT says that you need MIN <= OPT <= MAX, so a generic "MB_HW" control that does not require software to understand the individual MIN, OPT and MAX thresholds would still need to program all of these under the hood so as to avoid an invalid combination being set in the hardware. If I have understood the definition of the MARC table correctly, then there is a separate flag to report the presence of each of MIN, MAX and OPT, so software _might_ be expected to use a random subset of them(?) (If so, that's somewhat like the MPAM situation.) So, I wonder whether we could actually have the following on RDT? info ├── MB ┆ └── resource_schemata ├── MB ┆ └── MB_HW ├── MB_MAX ├── MB_MIN └── MB_OPT If MB_HW is programmed by software, then MB_MAX, MB_OPT and MB_MIN would be programmed with some reasonable default spread (or possibly, all with the same value). That way, software that wants independent control over MIN, OPT and MAX can have it (and sweat the problem of dealing with hardware where they aren't all implemented -- if that's a thing). But software that doesn't need this fine control gets a single MB_HW knob that is more-or- less portable between platforms. Does that makes sense, or is it an abstraction too far? (Going one step further, maybe we can actually put MPAM and RDT together with a 3-threshold model. For MPAM, we could possibly express the HARDLIM option using the extra threshold... that probably needs a bit more thought, though.) > > There were already discussions about appending a number to a schema > > name in order to control different memory regions -- that's another > > prefix/suffix relationship, if so... > > > > We could handle all of this by documenting all the relationships > > explicitly. But I'm thinking that it could be easier for maintanance > > if the resctrl core code has explicit knowledge of the relationships. > > Not just for resctrl self but to make clear to user space which > controls impact others and which are independent. > > That said, using a common prefix is still a good idea. But maybe we > > shouldn't lean on it too heavily as a way of actually describing the > > relationships? > I do not think we can rely on order in schemata file though. For example, > I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to > also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either > case the schemata may print something like below on both platforms (copied from > your original example) where for MPAM it implies a relationship but for RDT it > does not: > > MB: 0=50, 1=50 > # MB_HW: 0=32, 1=32 > # MB_MIN: 0=31, 1=31 > # MB_MAX: 0=32, 1=32 This still DTRT though? If MB_HW maps into the "optimal bandwidth" control on RDT, then it is still safe to program it first, before MB_{MIN,MAX}. The contents of the schemata file won't be suffucient to figure out the relationships, but that wasn't my intention. We have info/ for that. Instead, the schemata file just needs to be ordered in a way that is compatible with those relationships, so that one line does not unintentionally clobber the effect of a subsequent line. My concern was that if we rely totally on manual maintenance to keep the schemata file in a compatible order, we'll probably get that wrong sooner or later... > >>> Since the "#" convention is for backward compatibility, maybe we should > >>> not use this for new schemata, and place the burden of managing > >>> conflicts onto userspace going forward. What do you think? > >> > >> I agree. The way I understand this is that the '#' will only be used for > >> new controls that shadow the default/current controls of the legacy resources. > >> I do not expect that the prefix will be needed for new resources, even if > >> the initial support of a new resource does not include all possible controls. > > > > OK. Note, relating this to the above, the # could be interpreted as > > meaning "this is a child of some other schema; don't mess with it > > unless you know what you are doing". > > Could it be made more specific to be "this is a child of a legacy schema created > before this new format existed; don't mess with it unless you know what you are > doing"? > That is, any schema created after this new format is established does not need > the '#' prefix even if there is a parent/child relationship? Yes, I think so. Except: if some schema is advertised and documented with no children, then is it reasonable for software to assume that it will never have children? I think that the answer is probably "yes", in which case would it make sense to # any schema that is a child of some schema that did not have children in some previous upstream kernel? > > > > Older software doesn't understand the relationships, so this is just > > there to stop it from shooting itself in the foot. > > ack. > > By extension I assume that software that understands a schema that is introduced > after the "relationship" format is established can be expected to understand the > format and thus these new schemata do not require the '#' prefix. Even if > a new schema is introduced with a single control it can be followed by a new child > control without a '#' prefix a couple of kernel releases later. By this point it > should hopefully be understood by user space that it should not write entries it does > not understand. Generally, yes. I think that boils down to: "OK, previously you could just tweak bits of the whole schemata file you read and write the whole thing back, and the effect would be what you inuitively expected. But in future different schemata in the file may not be independent of one another. We'll warn you which things might not be independent, but we may not describe exactly how they affect each other. "So, from now on, only write the things that you actually want to set." Does that sound about right? [...] > >>> > >>> OK -- note, I don't think we have any immediate plan to support [HARDLIM] in > >>> the MPAM driver, but it may land eventually in some form. > >>> > >> > >> ack. > > > > (Or, of course, anything else that achieves the same goal...) > > Right ... I did not dig into syntax that could be made to match existing > schema formats etc. that can be filled in later. Ack > ... > > >>> I'll try to pull the state of this discussion together -- maybe as a > >>> draft update to the documentation, describing the interface as proposed > >>> so far. Does that work for you? > >> > >> It does. Thank you very much for taking this on. > >> > >> Reinette > > > > OK, I'll aim to follow up on this next week. > > Thank you very much. > > Reinette Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-20 15:50 ` Dave Martin @ 2025-10-20 16:31 ` Luck, Tony 2025-10-21 14:37 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Luck, Tony @ 2025-10-20 16:31 UTC (permalink / raw) To: Dave Martin Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote: > Hi Reinette, > > On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote: > > Hi Dave, > > > > On 10/17/25 7:17 AM, Dave Martin wrote: > > > Hi Reinette, > > > > > > On Thu, Oct 16, 2025 at 09:31:45AM -0700, Reinette Chatre wrote: > > >> Hi Dave, > > >> > > >> On 10/15/25 8:47 AM, Dave Martin wrote: > > [...] > > > >>> To avoid printing entries in the wrong order, do we want to track some > > >>> parent/child relationship between schemata. > > >>> > > >>> In the above example, > > >>> > > >>> * MB is the parent of MB_HW; > > >>> > > >>> * MB_HW is the parent of MB_MIN and MB_MAX. > > >>> > > >>> (for MPAM, at least). > > >> > > >> Could you please elaborate this relationship? I envisioned the MB_HW to be > > >> something similar to Intel RDT's "optimal" bandwidth setting ... something > > >> that is expected to be somewhere between the "min" and the "max". > > >> > > >> But, now I think I'm a bit lost in MPAM since it is not clear to me what > > >> MB_HW represents ... would this be the "memory bandwidth portion > > >> partitioning"? Although, that uses a completely different format from > > >> "min" and "max". > > > > > > I confess that I'm thinking with an MPAM mindset here. > > > > > > Some pseudocode might help to illustrate how these might interact: > > > > > > set_MB(partid, val) { > > > set_MB_HW(partid, percent_to_hw_val(val)); > > [...] > > > > get_MB_MAX(partid) { return mpam->MBW_MAX[partid]; } > > > > > > > > > The parent/child relationship I suggested is basically the call-graph > > > of this pseudocode. These could all be exposed as resctrl schemata, > > > but the children provide finer / more broken-down control than the > > > parents. Reading a parent provides a merged or approximated view of > > > the configuration of the child schemata. > > > > > > In particular, > > > > > > set_child(partid, get_child(partid)); > > > get_parent(partid); > > > > > > yields the same result as > > > > > > get_parent(partid); > > > > > > but will not be true in general, if the roles of parent and child are > > > reversed. > > > > > > I think still this holds true if implementing an "MB_HW" schema for > > > newer revisions of RDT. The pseudocode would be different, but there > > > will still be a tree-like call graph (?) > > > > Thank you very much for the example. I missed in earlier examples that > > MB_HW was being controlled via MB_MAX and MB_MIN. > > I do not expect such a dependence or tree-like call graph for RDT where > > the closest equivalent (termed "optimal") is programmed independently from > > min and max. > > I hadn't realised that this RDT feature as three control thresholds. > > I'll comment in more detail on your sample info/ hierarchy, below. > > > > > > > Going back to MPAM: > > [...] > > > > So, it may make more sense to expose [MBWPBM] as a separate, bitmap schema. > > > > > > (The same goes for "Proportional stride" partitioning. It's another, > > > different, control for memory bandwidth. As of today, I don't think > > > that we have a reference platform for experimenting with either of > > > these.) > > > > Thank you. > > > > > > > > > > >>> When schemata is read, parents should always be printed before their > > >>> child schemata. But really, we just need to make sure that the > > >>> rdt_schema_all list is correctly ordered. > > >>> > > >>> > > >>> Do you think that this relationship needs to be reported to userspace? > > [...] > > > >> We do have the info directory available to express relationships and a > > >> hierarchy is already starting to taking shape there. > > > > > > I'm wondering whether using a common prefix will be future-proof? It > > > may not always be clear which part of a name counts as the common > > > prefix. > > > > Apologies for my cryptic response. I was actually musing that we already > > discussed using the info directory to express relationships between > > controls and resources and it does not seem a big leap to expand > > this to express relationships between controls. Consider something > > like below for MPAM: > > > > info > > └── MB > > └── resource_schemata > > └── MB > > └── MB_HW > > ├── MB_MAX > > └── MB_MIN > > > > > > On RDT it may then look different: > > > > info > > └── MB > > └── resource_schemata > > └── MB > > ├── MB_HW > > ├── MB_MAX > > └── MB_MIN > > > > Having the resource name as common prefix does seem consistent and makes > > clear to user space which controls apply to a resource. > > Ack. > > The above hierarchies make sense, but I wonder whether we should be > forcing software to understand the MIN and MAX limits? > > I can still see a benefit in having MB_HW be a generic, software- > defined control, even on RDT. Then, this can always be available, > with similar behaviour, on all resctrl instances that support memory > bandwidth controls. The precise set of child controls will vary per > arch (and on MPAM at least, between different hardware > implementations) -- so these look like they will work less well as a > generic interface. > > > Considering RDT: to avoid random regulation behaviour, RDT says that > you need MIN <= OPT <= MAX, so a generic "MB_HW" control that does not > require software to understand the individual MIN, OPT and MAX > thresholds would still need to program all of these under the hood so > as to avoid an invalid combination being set in the hardware. > > If I have understood the definition of the MARC table correctly, then > there is a separate flag to report the presence of each of MIN, MAX and > OPT, so software _might_ be expected to use a random subset of them(?) > (If so, that's somewhat like the MPAM situation.) > > So, I wonder whether we could actually have the following on RDT? > > info > ├── MB > ┆ └── resource_schemata > ├── MB > ┆ └── MB_HW > ├── MB_MAX > ├── MB_MIN > └── MB_OPT > > If MB_HW is programmed by software, then MB_MAX, MB_OPT and MB_MIN > would be programmed with some reasonable default spread (or possibly, > all with the same value). > > That way, software that wants independent control over MIN, OPT and MAX > can have it (and sweat the problem of dealing with hardware where they > aren't all implemented -- if that's a thing). But software that > doesn't need this fine control gets a single MB_HW knob that is more-or- > less portable between platforms. > > Does that makes sense, or is it an abstraction too far? > > > (Going one step further, maybe we can actually put MPAM and RDT > together with a 3-threshold model. For MPAM, we could possibly express > the HARDLIM option using the extra threshold... that probably needs a > bit more thought, though.) > > > > There were already discussions about appending a number to a schema > > > name in order to control different memory regions -- that's another > > > prefix/suffix relationship, if so... > > > > > > We could handle all of this by documenting all the relationships > > > explicitly. But I'm thinking that it could be easier for maintanance > > > if the resctrl core code has explicit knowledge of the relationships. > > > > Not just for resctrl self but to make clear to user space which > > controls impact others and which are independent. > > > That said, using a common prefix is still a good idea. But maybe we > > > shouldn't lean on it too heavily as a way of actually describing the > > > relationships? > > I do not think we can rely on order in schemata file though. For example, > > I think MPAM's MB_HW is close enough to RDT's "optimal bandwidth" for RDT to > > also use the MB_HW name (or maybe MPAM and RDT can both use MB_OPT?) in either > > case the schemata may print something like below on both platforms (copied from > > your original example) where for MPAM it implies a relationship but for RDT it > > does not: > > > > MB: 0=50, 1=50 > > # MB_HW: 0=32, 1=32 > > # MB_MIN: 0=31, 1=31 > > # MB_MAX: 0=32, 1=32 > > This still DTRT though? If MB_HW maps into the "optimal bandwidth" > control on RDT, then it is still safe to program it first, before > MB_{MIN,MAX}. > > The contents of the schemata file won't be suffucient to figure out the > relationships, but that wasn't my intention. We have info/ for that. > > Instead, the schemata file just needs to be ordered in a way that is > compatible with those relationships, so that one line does not > unintentionally clobber the effect of a subsequent line. > > > My concern was that if we rely totally on manual maintenance to keep the > schemata file in a compatible order, we'll probably get that wrong > sooner or later... > > > >>> Since the "#" convention is for backward compatibility, maybe we should > > >>> not use this for new schemata, and place the burden of managing > > >>> conflicts onto userspace going forward. What do you think? > > >> > > >> I agree. The way I understand this is that the '#' will only be used for > > >> new controls that shadow the default/current controls of the legacy resources. > > >> I do not expect that the prefix will be needed for new resources, even if > > >> the initial support of a new resource does not include all possible controls. > > > > > > OK. Note, relating this to the above, the # could be interpreted as > > > meaning "this is a child of some other schema; don't mess with it > > > unless you know what you are doing". > > > > Could it be made more specific to be "this is a child of a legacy schema created > > before this new format existed; don't mess with it unless you know what you are > > doing"? > > That is, any schema created after this new format is established does not need > > the '#' prefix even if there is a parent/child relationship? > > Yes, I think so. > > Except: if some schema is advertised and documented with no children, > then is it reasonable for software to assume that it will never have > children? > > I think that the answer is probably "yes", in which case would it make > sense to # any schema that is a child of some schema that did not have > children in some previous upstream kernel? > > > > > > > Older software doesn't understand the relationships, so this is just > > > there to stop it from shooting itself in the foot. > > > > ack. > > > > By extension I assume that software that understands a schema that is introduced > > after the "relationship" format is established can be expected to understand the > > format and thus these new schemata do not require the '#' prefix. Even if > > a new schema is introduced with a single control it can be followed by a new child > > control without a '#' prefix a couple of kernel releases later. By this point it > > should hopefully be understood by user space that it should not write entries it does > > not understand. > > Generally, yes. > > I think that boils down to: "OK, previously you could just tweak bits > of the whole schemata file you read and write the whole thing back, > and the effect would be what you inuitively expected. But in future > different schemata in the file may not be independent of one another. > We'll warn you which things might not be independent, but we may not > describe exactly how they affect each other. Changes to the schemata file are currently "staged" and then applied. There's some filesystem level error/sanity checking during the parsing phase, but maybe for MB some parts can also be delayed, and re-ordered when architecture code applies the changes. E.g. while filesystem code could check min <= opt <= max. Architecture code would be responsible to write the values to h/w in a sane manner (assuming architecture cares about transient effects when things don't conform to the ordering). E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60 Regardless of the order those requests appeared in the write(2) syscall architecture bumps max to 60, then opt to 50, and finally min to 40. > > "So, from now on, only write the things that you actually want to set." > > Does that sound about right? Users might still use their favorite editor on the schemata file and so write everything, while only changing a subset. So if we don't go for the full two-phase update I describe above this would be: "only *change* the things that you actually want to set". > [...] > > > >>> > > >>> OK -- note, I don't think we have any immediate plan to support [HARDLIM] in > > >>> the MPAM driver, but it may land eventually in some form. > > >>> > > >> > > >> ack. > > > > > > (Or, of course, anything else that achieves the same goal...) > > > > Right ... I did not dig into syntax that could be made to match existing > > schema formats etc. that can be filled in later. > > Ack > > > ... > > > > >>> I'll try to pull the state of this discussion together -- maybe as a > > >>> draft update to the documentation, describing the interface as proposed > > >>> so far. Does that work for you? > > >> > > >> It does. Thank you very much for taking this on. > > >> > > >> Reinette > > > > > > OK, I'll aim to follow up on this next week. > > > > Thank you very much. > > > > Reinette > > Cheers > ---Dave -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-20 16:31 ` Luck, Tony @ 2025-10-21 14:37 ` Dave Martin 2025-10-21 20:59 ` Luck, Tony 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-10-21 14:37 UTC (permalink / raw) To: Luck, Tony Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Tony, On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote: > On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote: > > Hi Reinette, > > > > On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote: [...] > > > By extension I assume that software that understands a schema that is introduced > > > after the "relationship" format is established can be expected to understand the > > > format and thus these new schemata do not require the '#' prefix. Even if > > > a new schema is introduced with a single control it can be followed by a new child > > > control without a '#' prefix a couple of kernel releases later. By this point it > > > should hopefully be understood by user space that it should not write entries it does > > > not understand. > > > > Generally, yes. > > > > I think that boils down to: "OK, previously you could just tweak bits > > of the whole schemata file you read and write the whole thing back, > > and the effect would be what you inuitively expected. But in future > > different schemata in the file may not be independent of one another. > > We'll warn you which things might not be independent, but we may not > > describe exactly how they affect each other. > > Changes to the schemata file are currently "staged" and then applied. > There's some filesystem level error/sanity checking during the parsing > phase, but maybe for MB some parts can also be delayed, and re-ordered > when architecture code applies the changes. > > E.g. while filesystem code could check min <= opt <= max. Architecture > code would be responsible to write the values to h/w in a sane manner > (assuming architecture cares about transient effects when things don't > conform to the ordering). > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60 > Regardless of the order those requests appeared in the write(2) syscall > architecture bumps max to 60, then opt to 50, and finally min to 40. This could be sorted indeed be sorted out during staging, but I'm not sure that we can/should rely on it. If we treat the data coming from a single write() as a transaction, and stage the whole thing before executing it, that's fine. But I think this has to be viewed as an optimisation rather than guaranteed semantics. We told userspace that schemata is an S_IFREG regular file, so we have to accept a write() boundary anywhere in the stream. (In fact, resctrl chokes if a write boundary occurs in the middle of a line. In practice, stdio buffering and similar means that this issue turns out to be difficult to hit, except with shell scripts that try to emit a line piecemeal -- I have a partial fix for that knocking around, but this throws up other problems, so I gave up for the time being.) We also cannot currently rely on userspace closing the fd between "transactions". We never told userspace to do that, previously. We could make a new requirement, but it feels unexpected/unreasonable (?) > > > > "So, from now on, only write the things that you actually want to set." > > > > Does that sound about right? > > Users might still use their favorite editor on the schemata file and > so write everything, while only changing a subset. So if we don't go > for the full two-phase update I describe above this would be: > > "only *change* the things that you actually want to set". [...] > -Tony This works if the schemata file is output in the right order (and the user doesn't change the order): # cat schemata MB:0=100;1=100 # MB_HW:0=1024;1=1024 -> # cat <<EOF >schemata MB:0=100;1=100 MB_HW:0=512,1=512 EOF ... though it may still be inefficient, if the lines are not staged together. The hardware memory bandwidth controls may get programmed twice, here -- though the final result is probably what was intended. I'd still prefer that we tell people that they should be doing this: # cat <<EOF >schemata MB_HW:0=512,1=512 EOF ...if they are really tyring to set MB_HW and don't care about the effect on MB? Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-21 14:37 ` Dave Martin @ 2025-10-21 20:59 ` Luck, Tony 2025-10-22 14:58 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Luck, Tony @ 2025-10-21 20:59 UTC (permalink / raw) To: Dave Martin Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote: > Hi Tony, > > On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote: > > On Mon, Oct 20, 2025 at 04:50:38PM +0100, Dave Martin wrote: > > > Hi Reinette, > > > > > > On Fri, Oct 17, 2025 at 08:59:45AM -0700, Reinette Chatre wrote: > > [...] > > > > > By extension I assume that software that understands a schema that is introduced > > > > after the "relationship" format is established can be expected to understand the > > > > format and thus these new schemata do not require the '#' prefix. Even if > > > > a new schema is introduced with a single control it can be followed by a new child > > > > control without a '#' prefix a couple of kernel releases later. By this point it > > > > should hopefully be understood by user space that it should not write entries it does > > > > not understand. > > > > > > Generally, yes. > > > > > > I think that boils down to: "OK, previously you could just tweak bits > > > of the whole schemata file you read and write the whole thing back, > > > and the effect would be what you inuitively expected. But in future > > > different schemata in the file may not be independent of one another. > > > We'll warn you which things might not be independent, but we may not > > > describe exactly how they affect each other. > > > > Changes to the schemata file are currently "staged" and then applied. > > There's some filesystem level error/sanity checking during the parsing > > phase, but maybe for MB some parts can also be delayed, and re-ordered > > when architecture code applies the changes. > > > > E.g. while filesystem code could check min <= opt <= max. Architecture > > code would be responsible to write the values to h/w in a sane manner > > (assuming architecture cares about transient effects when things don't > > conform to the ordering). > > > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60 > > Regardless of the order those requests appeared in the write(2) syscall > > architecture bumps max to 60, then opt to 50, and finally min to 40. > > This could be sorted indeed be sorted out during staging, but I'm not > sure that we can/should rely on it. > > If we treat the data coming from a single write() as a transaction, and > stage the whole thing before executing it, that's fine. But I think > this has to be viewed as an optimisation rather than guaranteed > semantics. > > > We told userspace that schemata is an S_IFREG regular file, so we have > to accept a write() boundary anywhere in the stream. > > (In fact, resctrl chokes if a write boundary occurs in the middle of a > line. In practice, stdio buffering and similar means that this issue > turns out to be difficult to hit, except with shell scripts that try to > emit a line piecemeal -- I have a partial fix for that knocking around, > but this throws up other problems, so I gave up for the time being.) Is this worth the pain and complexity? Maybe just document the reality of the implementation since day 1 of resctrl that each write(2) must contain one or more lines, each terminated with "\n". There are already so many ways that the schemata file does not behave like a regular S_IFREG file. E.g. accepting a write to just update one domain in a resource: # echo L3:2=0xff > schemata So describe schemata in terms of writing "update commands" rather than "Lines"? > > We also cannot currently rely on userspace closing the fd between > "transactions". We never told userspace to do that, previously. We > could make a new requirement, but it feels unexpected/unreasonable (?) > > > > > > > "So, from now on, only write the things that you actually want to set." > > > > > > Does that sound about right? > > > > Users might still use their favorite editor on the schemata file and > > so write everything, while only changing a subset. So if we don't go > > for the full two-phase update I describe above this would be: > > > > "only *change* the things that you actually want to set". I misremembered where the check for "did the user change the value" happened. I thought it was during parsing, but it is actually in resctrl_arch_update_domains() after all input parsing is complete and resctrl is applying changes. So unless we change things to work the way I hallucinated, then ordering does matter the way you described. > > [...] > > > -Tony > > This works if the schemata file is output in the right order (and the > user doesn't change the order): > > # cat schemata > MB:0=100;1=100 > # MB_HW:0=1024;1=1024 > > -> > > # cat <<EOF >schemata > MB:0=100;1=100 > MB_HW:0=512,1=512 > EOF > > ... though it may still be inefficient, if the lines are not staged > together. The hardware memory bandwidth controls may get programmed > twice, here -- though the final result is probably what was intended. > > I'd still prefer that we tell people that they should be doing this: > # cat <<EOF >schemata > MB_HW:0=512,1=512 > EOF > > ...if they are really tyring to set MB_HW and don't care about the > effect on MB? I'm starting to worry about this co-existence of old/new syntax for Intel region aware. Life seems simple if there is only one MB_HW connected to the legacy "MB". Updates to either will make both appear with new values when the schemata is read. E.g. # cat schemata MB:0=100 #MB_HW=255 # echo MB:0=50 > schemata # cat schemata MB:0=50 #MB_HW=127 But Intel will have several MB_HW controls, one for each region. [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here] # cat schemata MB:0=100 #MB_HW0=255 #MB_HW1=255 #MB_HW2=255 #MB_HW3=255 If the user sets just one of the HW controls: # echo MB_HW1=64 what should resctrl display for the legacy "MB:" line? > > Cheers > ---Dave -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-21 20:59 ` Luck, Tony @ 2025-10-22 14:58 ` Dave Martin 2025-10-22 16:21 ` Luck, Tony 0 siblings, 1 reply; 52+ messages in thread From: Dave Martin @ 2025-10-22 14:58 UTC (permalink / raw) To: Luck, Tony Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Tony, On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote: > Hi Dave, > > On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote: > > Hi Tony, > > > > On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote: [...] > > > Changes to the schemata file are currently "staged" and then applied. > > > There's some filesystem level error/sanity checking during the parsing > > > phase, but maybe for MB some parts can also be delayed, and re-ordered > > > when architecture code applies the changes. > > > > > > E.g. while filesystem code could check min <= opt <= max. Architecture > > > code would be responsible to write the values to h/w in a sane manner > > > (assuming architecture cares about transient effects when things don't > > > conform to the ordering). > > > > > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60 > > > Regardless of the order those requests appeared in the write(2) syscall > > > architecture bumps max to 60, then opt to 50, and finally min to 40. > > > > This could be sorted indeed be sorted out during staging, but I'm not > > sure that we can/should rely on it. > > > > If we treat the data coming from a single write() as a transaction, and > > stage the whole thing before executing it, that's fine. But I think > > this has to be viewed as an optimisation rather than guaranteed > > semantics. > > > > > > We told userspace that schemata is an S_IFREG regular file, so we have > > to accept a write() boundary anywhere in the stream. > > > > (In fact, resctrl chokes if a write boundary occurs in the middle of a > > line. In practice, stdio buffering and similar means that this issue > > turns out to be difficult to hit, except with shell scripts that try to > > emit a line piecemeal -- I have a partial fix for that knocking around, > > but this throws up other problems, so I gave up for the time being.) > > Is this worth the pain and complexity? Maybe just document the reality > of the implementation since day 1 of resctrl that each write(2) must > contain one or more lines, each terminated with "\n". <soapbox> We could, in the same way that a vendor could wire a UART directly to the pins of a regular mains power plug. They could stick a big label on it saying exactly how the pins should be hooked up to another low- voltage UART and not plugged into a mains power outlet... but you know what's going to happen. The whole point of a file-like interface is that the user doesn't (or shouldn't) have to craft I/O directly at the syscall level. If they have to do that, then the reasons for not relying on ioctl() or a binary protocol melt away (like that UART). Because the easy, unsafe way of working with these files almost always works, people are almost certainly going to use it, even if we tell them not to (IMHO). </soapbox> That said, for practical purposes, the interface is reliable enough (for now). We probably shouldn't mess with it unless we can come up with something that is clearly better. (I have some ideas, but I think it's off-topic, here.) > There are already so many ways that the schemata file does not behave > like a regular S_IFREG file. E.g. accepting a write to just update > one domain in a resource: # echo L3:2=0xff > schemata That still feels basically file-like. I can write something into a file, then something else can read what I wrote, interpret it in any way it likes, and write back something different for me to read. In our case, it is as if after each write() the kernel magically reads and rewrites the file before userspace gets a chance to do anything else. This doesn't work as a protocol between userspace processes, but the kernel can pull tricks that are not available to userspace -- so it can be made to work for user <-> kernel protocols (modulo the issues about write() boundaries etc.) > So describe schemata in terms of writing "update commands" rather > than "Lines"? That's reasonable. In practice, each line written is a request to the kernel to do something, but it's already the case that the kernel doesn't necessarily do exactly what was asked for (due to rounding, etc.) Overall, I think the current state of play is that we need to consider the lines to be independent "commands", and execute them in the order given. That's the model I've been assuming here. > > We also cannot currently rely on userspace closing the fd between > > "transactions". We never told userspace to do that, previously. We > > could make a new requirement, but it feels unexpected/unreasonable (?) > > > > > > > > > > "So, from now on, only write the things that you actually want to set." > > > > > > > > Does that sound about right? > > > > > > Users might still use their favorite editor on the schemata file and > > > so write everything, while only changing a subset. So if we don't go > > > for the full two-phase update I describe above this would be: > > > > > > "only *change* the things that you actually want to set". > > I misremembered where the check for "did the user change the value" > happened. I thought it was during parsing, but it is actually in > resctrl_arch_update_domains() after all input parsing is complete > and resctrl is applying changes. So unless we change things to work > the way I hallucinated, then ordering does matter the way you > described. Ah, right. There would be different ways to do this, but yes, that was my understanding of how things work today. > > > > [...] > > > > > -Tony > > > > This works if the schemata file is output in the right order (and the > > user doesn't change the order): > > > > # cat schemata > > MB:0=100;1=100 > > # MB_HW:0=1024;1=1024 > > > > -> > > > > # cat <<EOF >schemata > > MB:0=100;1=100 > > MB_HW:0=512,1=512 > > EOF > > > > ... though it may still be inefficient, if the lines are not staged > > together. The hardware memory bandwidth controls may get programmed > > twice, here -- though the final result is probably what was intended. > > > > I'd still prefer that we tell people that they should be doing this: > > # cat <<EOF >schemata > > MB_HW:0=512,1=512 > > EOF > > > > ...if they are really tyring to set MB_HW and don't care about the > > effect on MB? > > I'm starting to worry about this co-existence of old/new syntax for > Intel region aware. Life seems simple if there is only one MB_HW > connected to the legacy "MB". Updates to either will make both > appear with new values when the schemata is read. E.g. > > # cat schemata > MB:0=100 > #MB_HW=255 > > # echo MB:0=50 > schemata > > # cat schemata > MB:0=50 > #MB_HW=127 > > But Intel will have several MB_HW controls, one for each region. > [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here] > > # cat schemata > MB:0=100 > #MB_HW0=255 > #MB_HW1=255 > #MB_HW2=255 > #MB_HW3=255 > > If the user sets just one of the HW controls: > > # echo MB_HW1=64 > > what should resctrl display for the legacy "MB:" line? > > -Tony Erm, good question. I hadn't though too carefully about the region- aware case. I think it's reasonable to expect software that writes MB_HW<n> independently to pay attention only to these specific schemata when reading back -- a bit like accessing a C union. # echo 'MB:0=100' >schemata # cat schemata -> MB:0=100 # MB_HW:0=255 # MB_HW0:0=255 # MB_HW1:0=255 # MB_HW2:0=255 # MB_HW3:0=255 # echo 'MB:0=100' >schemata # cat schemata -> MB:0=50 # MB_HW:0=128 # MB_HW0:0=128 # MB_HW1:0=128 # MB_HW2:0=128 # MB_HW3:0=128 # echo 'MB_HW:0=127' >schemata # cat schemata -> MB:0=50 # MB_HW:0=127 # MB_HW0:0=127 # MB_HW1:0=127 # MB_HW2:0=127 # MB_HW3:0=127 # echo 'MB_HW1:0=64' >schemata # cat schemata -> MB:0=??? # MB_HW:0=??? # MB_HW0:0=127 # MB_HW1:0=64 # MB_HW2:0=127 # MB_HW3:0=127 The rules for populating the ??? entries could be designed to be somewhat intuitive, or we could just do the easiest thing. So, could we just pick one, fixed, region to read the MB_HW value from? Say, MB_HW0: MB:0=50 # MB_HW:0=127 # MB_HW0:0=127 # MB_HW1:0=64 # MB_HW2:0=127 # MB_HW3:0=127 Or take the average across all regions: MB:0=44 # MB_HW:0=111 # MB_HW0:0=127 # MB_HW1:0=64 # MB_HW2:0=127 # MB_HW3:0=127 The latter may be more costly or complex to implement, and I don't know whether it is really useful. Software that knows about the MB_HW<n> entries also knows that once you have looked at these, MB_HW and MB tell you nothing else. What do you think? I'm wondering whether setting the MB_HW<n> independently may be quite a specialised use case, which not everyone will want/need to do, but that's an assumption on my part. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-22 14:58 ` Dave Martin @ 2025-10-22 16:21 ` Luck, Tony 2025-10-23 14:04 ` Dave Martin 0 siblings, 1 reply; 52+ messages in thread From: Luck, Tony @ 2025-10-22 16:21 UTC (permalink / raw) To: Dave Martin Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, On Wed, Oct 22, 2025 at 03:58:08PM +0100, Dave Martin wrote: > Hi Tony, > > On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote: > > Hi Dave, > > > > On Tue, Oct 21, 2025 at 03:37:35PM +0100, Dave Martin wrote: > > > Hi Tony, > > > > > > On Mon, Oct 20, 2025 at 09:31:18AM -0700, Luck, Tony wrote: > > [...] > > > > > Changes to the schemata file are currently "staged" and then applied. > > > > There's some filesystem level error/sanity checking during the parsing > > > > phase, but maybe for MB some parts can also be delayed, and re-ordered > > > > when architecture code applies the changes. > > > > > > > > E.g. while filesystem code could check min <= opt <= max. Architecture > > > > code would be responsible to write the values to h/w in a sane manner > > > > (assuming architecture cares about transient effects when things don't > > > > conform to the ordering). > > > > > > > > E.g. User requests moving from min,opt,max = 10,20,30 to 40,50,60 > > > > Regardless of the order those requests appeared in the write(2) syscall > > > > architecture bumps max to 60, then opt to 50, and finally min to 40. > > > > > > This could be sorted indeed be sorted out during staging, but I'm not > > > sure that we can/should rely on it. > > > > > > If we treat the data coming from a single write() as a transaction, and > > > stage the whole thing before executing it, that's fine. But I think > > > this has to be viewed as an optimisation rather than guaranteed > > > semantics. > > > > > > > > > We told userspace that schemata is an S_IFREG regular file, so we have > > > to accept a write() boundary anywhere in the stream. > > > > > > (In fact, resctrl chokes if a write boundary occurs in the middle of a > > > line. In practice, stdio buffering and similar means that this issue > > > turns out to be difficult to hit, except with shell scripts that try to > > > emit a line piecemeal -- I have a partial fix for that knocking around, > > > but this throws up other problems, so I gave up for the time being.) > > > > Is this worth the pain and complexity? Maybe just document the reality > > of the implementation since day 1 of resctrl that each write(2) must > > contain one or more lines, each terminated with "\n". > > <soapbox> > > We could, in the same way that a vendor could wire a UART directly to > the pins of a regular mains power plug. They could stick a big label > on it saying exactly how the pins should be hooked up to another low- > voltage UART and not plugged into a mains power outlet... but you know > what's going to happen. The PDP 11/03 for undegraduate Comp Sci student use at my univeristy had allegedly been student proofed against such things. Oral history said you could wire 240V mains across input pins to get a 50 Hz clock. I didn't test this theory. > The whole point of a file-like interface is that the user doesn't (or > shouldn't) have to craft I/O directly at the syscall level. If they > have to do that, then the reasons for not relying on ioctl() or a > binary protocol melt away (like that UART). > > Because the easy, unsafe way of working with these files almost always > works, people are almost certainly going to use it, even if we tell > them not to (IMHO). > > </soapbox> > > > That said, for practical purposes, the interface is reliable enough > (for now). We probably shouldn't mess with it unless we can come up > with something that is clearly better. > > (I have some ideas, but I think it's off-topic, here.) Agreed off-topic ... but fixing it seems hard. What if I do: # echo -n "L3:0=" > schemata and then my control program dies? > > There are already so many ways that the schemata file does not behave > > like a regular S_IFREG file. E.g. accepting a write to just update > > one domain in a resource: # echo L3:2=0xff > schemata > > That still feels basically file-like. I can write something into a > file, then something else can read what I wrote, interpret it in any > way it likes, and write back something different for me to read. > > In our case, it is as if after each write() the kernel magically reads > and rewrites the file before userspace gets a chance to do anything > else. This doesn't work as a protocol between userspace processes, but > the kernel can pull tricks that are not available to userspace -- so it > can be made to work for user <-> kernel protocols (modulo the issues > about write() boundaries etc.) > > > So describe schemata in terms of writing "update commands" rather > > than "Lines"? > > That's reasonable. In practice, each line written is a request to the > kernel to do something, but it's already the case that the kernel > doesn't necessarily do exactly what was asked for (due to rounding, > etc.) > > > Overall, I think the current state of play is that we need to consider > the lines to be independent "commands", and execute them in the order > given. > > That's the model I've been assuming here. > > > > > We also cannot currently rely on userspace closing the fd between > > > "transactions". We never told userspace to do that, previously. We > > > could make a new requirement, but it feels unexpected/unreasonable (?) > > > > > > > > > > > > > "So, from now on, only write the things that you actually want to set." > > > > > > > > > > Does that sound about right? > > > > > > > > Users might still use their favorite editor on the schemata file and > > > > so write everything, while only changing a subset. So if we don't go > > > > for the full two-phase update I describe above this would be: > > > > > > > > "only *change* the things that you actually want to set". > > > > I misremembered where the check for "did the user change the value" > > happened. I thought it was during parsing, but it is actually in > > resctrl_arch_update_domains() after all input parsing is complete > > and resctrl is applying changes. So unless we change things to work > > the way I hallucinated, then ordering does matter the way you > > described. > > Ah, right. > > There would be different ways to do this, but yes, that was my > understanding of how things work today. > > > > > > > [...] > > > > > > > -Tony > > > > > > This works if the schemata file is output in the right order (and the > > > user doesn't change the order): > > > > > > # cat schemata > > > MB:0=100;1=100 > > > # MB_HW:0=1024;1=1024 > > > > > > -> > > > > > > # cat <<EOF >schemata > > > MB:0=100;1=100 > > > MB_HW:0=512,1=512 > > > EOF > > > > > > ... though it may still be inefficient, if the lines are not staged > > > together. The hardware memory bandwidth controls may get programmed > > > twice, here -- though the final result is probably what was intended. > > > > > > I'd still prefer that we tell people that they should be doing this: > > > # cat <<EOF >schemata > > > MB_HW:0=512,1=512 > > > EOF > > > > > > ...if they are really tyring to set MB_HW and don't care about the > > > effect on MB? > > > > I'm starting to worry about this co-existence of old/new syntax for > > Intel region aware. Life seems simple if there is only one MB_HW > > connected to the legacy "MB". Updates to either will make both > > appear with new values when the schemata is read. E.g. > > > > # cat schemata > > MB:0=100 > > #MB_HW=255 > > > > # echo MB:0=50 > schemata > > > > # cat schemata > > MB:0=50 > > #MB_HW=127 > > > > But Intel will have several MB_HW controls, one for each region. > > [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here] > > > > # cat schemata > > MB:0=100 > > #MB_HW0=255 > > #MB_HW1=255 > > #MB_HW2=255 > > #MB_HW3=255 > > > > If the user sets just one of the HW controls: > > > > # echo MB_HW1=64 > > > > what should resctrl display for the legacy "MB:" line? > > > > -Tony > > Erm, good question. I hadn't though too carefully about the region- > aware case. > > I think it's reasonable to expect software that writes MB_HW<n> > independently to pay attention only to these specific schemata when > reading back -- a bit like accessing a C union. > > # echo 'MB:0=100' >schemata > # cat schemata > -> > MB:0=100 > # MB_HW:0=255 > # MB_HW0:0=255 > # MB_HW1:0=255 > # MB_HW2:0=255 > # MB_HW3:0=255 > > # echo 'MB:0=100' >schemata > # cat schemata > -> > MB:0=50 > # MB_HW:0=128 > # MB_HW0:0=128 > # MB_HW1:0=128 > # MB_HW2:0=128 > # MB_HW3:0=128 > > # echo 'MB_HW:0=127' >schemata > # cat schemata > -> > MB:0=50 > # MB_HW:0=127 > # MB_HW0:0=127 > # MB_HW1:0=127 > # MB_HW2:0=127 > # MB_HW3:0=127 > > # echo 'MB_HW1:0=64' >schemata > # cat schemata > -> > MB:0=??? > # MB_HW:0=??? > # MB_HW0:0=127 > # MB_HW1:0=64 > # MB_HW2:0=127 > # MB_HW3:0=127 > > The rules for populating the ??? entries could be designed to be > somewhat intuitive, or we could just do the easiest thing. > > So, could we just pick one, fixed, region to read the MB_HW value from? > Say, MB_HW0: > > MB:0=50 > # MB_HW:0=127 > # MB_HW0:0=127 > # MB_HW1:0=64 > # MB_HW2:0=127 > # MB_HW3:0=127 > > Or take the average across all regions: > > MB:0=44 > # MB_HW:0=111 > # MB_HW0:0=127 > # MB_HW1:0=64 > # MB_HW2:0=127 > # MB_HW3:0=127 > > The latter may be more costly or complex to implement, and I don't > know whether it is really useful. Software that knows about the > MB_HW<n> entries also knows that once you have looked at these, MB_HW > and MB tell you nothing else. > > What do you think? > > I'm wondering whether setting the MB_HW<n> independently may be quite a > specialised use case, which not everyone will want/need to do, but > that's an assumption on my part. It's difficult to guess what users will want to do. But it is likely the case that total available bandwidth to regions will be different (local DDR > remote DDR > CXL). So while the system will boot up with no throttling on any region, it may be useful to enforce more throttling on access to the slower regions. Rather than trying to make up some number to fill in the ?? for the MB: line, another option would be to stop showing the legacy MB: line in schemata as soon as the user shows they know about the direct HW access mode by writing any of the HW lines. Any sysadmin trying to mix and match legacy access with direct HW access is going to run into problems very quickly. In the spirit of not giving them the cable to connect mains to the UART, perhaps removing the foot-gun from the table might be a good option? > Cheers > ---Dave -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-10-22 16:21 ` Luck, Tony @ 2025-10-23 14:04 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-10-23 14:04 UTC (permalink / raw) To: Luck, Tony Cc: Reinette Chatre, linux-kernel, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Tony, On Wed, Oct 22, 2025 at 09:21:03AM -0700, Luck, Tony wrote: > Hi Dave, > > On Wed, Oct 22, 2025 at 03:58:08PM +0100, Dave Martin wrote: > > Hi Tony, > > > > On Tue, Oct 21, 2025 at 01:59:36PM -0700, Luck, Tony wrote: [...] > > <soapbox> > > > > We could, in the same way that a vendor could wire a UART directly to > > the pins of a regular mains power plug. They could stick a big label > > on it saying exactly how the pins should be hooked up to another low- > > voltage UART and not plugged into a mains power outlet... but you know > > what's going to happen. > > The PDP 11/03 for undegraduate Comp Sci student use at my univeristy had allegedly > been student proofed against such things. Oral history said you could wire 240V > mains across input pins to get a 50 Hz clock. I didn't test this theory. Now, there's an idea... > > The whole point of a file-like interface is that the user doesn't (or > > shouldn't) have to craft I/O directly at the syscall level. If they > > have to do that, then the reasons for not relying on ioctl() or a > > binary protocol melt away (like that UART). > > > > Because the easy, unsafe way of working with these files almost always > > works, people are almost certainly going to use it, even if we tell > > them not to (IMHO). > > > > </soapbox> > > > > > > That said, for practical purposes, the interface is reliable enough > > (for now). We probably shouldn't mess with it unless we can come up > > with something that is clearly better. > > > > (I have some ideas, but I think it's off-topic, here.) > > Agreed off-topic ... but fixing it seems hard. What if I do: > > # echo -n "L3:0=" > schemata > > and then my control program dies? Probably nothing? In my hack for this, I buffered a partial line for each open struct file. If the struct file survives the terminated program, something else could append more to the incomplete line through any fd still open on the struct file (as in my { { echo; ... echo; } >schememta; } shell example). Otherwise, when the file is closed with an incomplete line, an error could be reported through close(). I implemented this, but it turns out not to be a magic bullet -- lots of software doesn't check the return value from close() / fclose(), and Linux's version of dup2() just silently loses close-time errors on the fd being clobbered. (dash, and probably other shells, undo redirections using dup2(). Dupping the victim fd before the dup2(), so that it can be closed separately, can help -- as documented in the dup2() man page. But as of today, most software probably doesn't do this. Some OSes seem to have different dup2() behaviour that doesn't suffer from this problem.) Anyway, all in all, I wasn't convinced that this approach created fewer problems than it solved... [...] > > > I'm starting to worry about this co-existence of old/new syntax for > > > Intel region aware. Life seems simple if there is only one MB_HW > > > connected to the legacy "MB". Updates to either will make both > > > appear with new values when the schemata is read. E.g. > > > > > > # cat schemata > > > MB:0=100 > > > #MB_HW=255 > > > > > > # echo MB:0=50 > schemata > > > > > > # cat schemata > > > MB:0=50 > > > #MB_HW=127 > > > > > > But Intel will have several MB_HW controls, one for each region. > > > [Schemata names TBD, but I'll just call them 0, 1, 2, 3 here] > > > > > > # cat schemata > > > MB:0=100 > > > #MB_HW0=255 > > > #MB_HW1=255 > > > #MB_HW2=255 > > > #MB_HW3=255 > > > > > > If the user sets just one of the HW controls: > > > > > > # echo MB_HW1=64 > > > > > > what should resctrl display for the legacy "MB:" line? > > > > > > -Tony > > > > Erm, good question. I hadn't though too carefully about the region- > > aware case. > > > > I think it's reasonable to expect software that writes MB_HW<n> > > independently to pay attention only to these specific schemata when > > reading back -- a bit like accessing a C union. > > > > # echo 'MB:0=100' >schemata > > # cat schemata > > -> > > MB:0=100 > > # MB_HW:0=255 > > # MB_HW0:0=255 > > # MB_HW1:0=255 > > # MB_HW2:0=255 > > # MB_HW3:0=255 > > > > # echo 'MB:0=100' >schemata > > # cat schemata > > -> > > MB:0=50 > > # MB_HW:0=128 > > # MB_HW0:0=128 > > # MB_HW1:0=128 > > # MB_HW2:0=128 > > # MB_HW3:0=128 > > > > # echo 'MB_HW:0=127' >schemata > > # cat schemata > > -> > > MB:0=50 > > # MB_HW:0=127 > > # MB_HW0:0=127 > > # MB_HW1:0=127 > > # MB_HW2:0=127 > > # MB_HW3:0=127 > > > > # echo 'MB_HW1:0=64' >schemata > > # cat schemata > > -> > > MB:0=??? > > # MB_HW:0=??? > > # MB_HW0:0=127 > > # MB_HW1:0=64 > > # MB_HW2:0=127 > > # MB_HW3:0=127 > > > > The rules for populating the ??? entries could be designed to be > > somewhat intuitive, or we could just do the easiest thing. > > > > So, could we just pick one, fixed, region to read the MB_HW value from? > > Say, MB_HW0: > > > > MB:0=50 > > # MB_HW:0=127 > > # MB_HW0:0=127 > > # MB_HW1:0=64 > > # MB_HW2:0=127 > > # MB_HW3:0=127 > > > > Or take the average across all regions: > > > > MB:0=44 > > # MB_HW:0=111 > > # MB_HW0:0=127 > > # MB_HW1:0=64 > > # MB_HW2:0=127 > > # MB_HW3:0=127 > > > > The latter may be more costly or complex to implement, and I don't > > know whether it is really useful. Software that knows about the > > MB_HW<n> entries also knows that once you have looked at these, MB_HW > > and MB tell you nothing else. > > > > What do you think? > > > > I'm wondering whether setting the MB_HW<n> independently may be quite a > > specialised use case, which not everyone will want/need to do, but > > that's an assumption on my part. > > It's difficult to guess what users will want to do. But it is likely > the case that total available bandwidth to regions will be different > (local DDR > remote DDR > CXL). So while the system will boot up with > no throttling on any region, it may be useful to enforce more throttling > on access to the slower regions. > > Rather than trying to make up some number to fill in the ?? for the MB: > line, another option would be to stop showing the legacy MB: line in schemata > as soon as the user shows they know about the direct HW access mode > by writing any of the HW lines. > > Any sysadmin trying to mix and match legacy access with direct HW access > is going to run into problems very quickly. In the spirit of not giving > them the cable to connect mains to the UART, perhaps removing the > foot-gun from the table might be a good option? > > -Tony Quite possibly. Ideally, we'd have some kind of generic interface, but (as with "MB") there's always the risk that the hardware evolves in directions that don't fit the abstraction. For now, I will try to refocus the discussion back onto the schema description topic. I think that's probably the easiest thing to get nailed down before we try to figure out how to deal with the "shadow schema" issue. Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 13:56 ` Dave Martin 2025-09-29 16:09 ` Reinette Chatre @ 2025-09-29 16:37 ` Luck, Tony 2025-09-30 16:02 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Luck, Tony @ 2025-09-29 16:37 UTC (permalink / raw) To: Dave Martin Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote: > Hi Tony, > > Thanks for taking at look at this -- comments below. > > [...] > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > > [...] > > > > Would something like the following work? A read from schemata might > > > produce something like this: > > > > > > MB: 0=50, 1=50 > > > # MB_HW: 0=32, 1=32 > > > # MB_MIN: 0=31, 1=31 > > > # MB_MAX: 0=32, 1=32 > > [...] > > > > I'd be interested in people's thoughts on it, though. > > > > Applying this to Intel upcoming region aware memory bandwidth > > that supports 255 steps and h/w min/max limits. > > Following the MPAM example, would you also expect: > > scale: 255 > unit: 100pc > > ...? Yes. 255 (or whatever "Q" value is provided in the ACPI table) corresponds to no throttling, so 100% bandwidth. > > > We would have info files with "min = 1, max = 255" and a schemata > > file that looks like this to legacy apps: > > > > MB: 0=50;1=75 > > #MB_HW: 0=128;1=191 > > #MB_MIN: 0=128;1=191 > > #MB_MAX: 0=128;1=191 > > > > But a newer app that is aware of the extensions can write: > > > > # cat > schemata << 'EOF' > > MB_HW: 0=10 > > MB_MIN: 0=10 > > MB_MAX: 0=64 > > EOF > > > > which then reads back as: > > MB: 0=4;1=75 > > #MB_HW: 0=10;1=191 > > #MB_MIN: 0=10;1=191 > > #MB_MAX: 0=64;1=191 > > > > with the legacy line updated with the rounded value of the MB_HW > > supplied by the user. 10/255 = 3.921% ... so call it "4". > > I'm suggesting that this always be rounded up, so that you have a > guarantee that the steps are no smaller than the reported value. Round up, rather than round-to-nearest, make sense. Though perhaps only cosmetic as I would be surprised if anyone has a mix of tools looking at the legacy schemata lines while programming using the direct h/w controls. > > (In this case, round-up and round-to-nearest give the same answer > anyway, though!) > > > > > The region aware h/w supports separate bandwidth controls for each > > region. We could hope (or perhaps update the spec to define) that > > region0 is always node-local DDR memory and keep the "MB" tag for > > that. > > Do you have concerns about existing software choking on the #-prefixed > lines? Do they even need a # prefix? We already mix lines for multiple resources in the schemata file with a separate prefix for each resource. The schemata file also allows writes to just update one resource (or one domain in a single resource). The schemata file started with just "L3". Then we added "L2", "MB", and "SMBA" with no concern that the initial "L3" manipulating tools would be confused. > > Then use some other tag naming for other regions. Remote DDR, > > local CXL, remote CXL are the ones we think are next in the h/w > > memory sequence. But the "region" concept would allow for other > > options as other memory technologies come into use. > > Would it be reasnable just to have a set of these schema instances, per > region, so: > > MB_HW: ... // implicitly region 0 > MB_HW_1: ... > MB_HW_2: ... Chen Yu is currently looking at putting the word "TIER" into the name, since there's some precedent for describing memory in "tiers". Whatever naming scheme is used, the important part is how will users find out what each schemata line actually means/controls. > etc. > > Or, did you have something else in mind? > > My thinking is that we avoid adding complexity in the schemata file if > we treat mapping these schema instances onto the hardware topology as > an orthogonal problem. So long as we have unique names in the schemata > file, we can describe elsewhere what they relate to in the hardware. Yes, exactly this. > > Cheers > ---Dave -Tony ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-29 16:37 ` Luck, Tony @ 2025-09-30 16:02 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-09-30 16:02 UTC (permalink / raw) To: Luck, Tony Cc: linux-kernel, Reinette Chatre, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Tony, On Mon, Sep 29, 2025 at 09:37:41AM -0700, Luck, Tony wrote: > On Mon, Sep 29, 2025 at 02:56:19PM +0100, Dave Martin wrote: > > Hi Tony, > > > > Thanks for taking at look at this -- comments below. > > > > [...] > > > > On Thu, Sep 25, 2025 at 03:58:35PM -0700, Luck, Tony wrote: > > > On Mon, Sep 22, 2025 at 04:04:40PM +0100, Dave Martin wrote: > > > > [...] > > > > > > Would something like the following work? A read from schemata might > > > > produce something like this: > > > > > > > > MB: 0=50, 1=50 > > > > # MB_HW: 0=32, 1=32 > > > > # MB_MIN: 0=31, 1=31 > > > > # MB_MAX: 0=32, 1=32 > > > > [...] > > > > > > I'd be interested in people's thoughts on it, though. > > > > > > Applying this to Intel upcoming region aware memory bandwidth > > > that supports 255 steps and h/w min/max limits. > > > > Following the MPAM example, would you also expect: > > > > scale: 255 > > unit: 100pc > > > > ...? > > Yes. 255 (or whatever "Q" value is provided in the ACPI table) > corresponds to no throttling, so 100% bandwidth. > > > > > > We would have info files with "min = 1, max = 255" and a schemata > > > file that looks like this to legacy apps: [...] > > > MB: 0=4;1=75 [...] > > > with the legacy line updated with the rounded value of the MB_HW > > > supplied by the user. 10/255 = 3.921% ... so call it "4". > > > > I'm suggesting that this always be rounded up, so that you have a > > guarantee that the steps are no smaller than the reported value. > > Round up, rather than round-to-nearest, make sense. Though perhaps > only cosmetic as I would be surprised if anyone has a mix of tools > looking at the legacy schemata lines while programming using the > direct h/w controls. Ack [...] > > Do you have concerns about existing software choking on the #-prefixed > > lines? > > Do they even need a # prefix? We already mix lines for multiple > resources in the schemata file with a separate prefix for each resource. > The schemata file also allows writes to just update one resource (or > one domain in a single resource). The schemata file started with just > "L3". Then we added "L2", "MB", and "SMBA" with no concern that the > initial "L3" manipulating tools would be confused. The "#" thing is for backwards compatibility with old userspace that might blindly "paste back" unknown entries when writing the schemata file. (See also my reply to Reinette [1].) > > > Then use some other tag naming for other regions. Remote DDR, > > > local CXL, remote CXL are the ones we think are next in the h/w > > > memory sequence. But the "region" concept would allow for other > > > options as other memory technologies come into use. > > > > Would it be reasnable just to have a set of these schema instances, per > > region, so: > > > > MB_HW: ... // implicitly region 0 > > MB_HW_1: ... > > MB_HW_2: ... > > Chen Yu is currently looking at putting the word "TIER" into the > name, since there's some precedent for describing memory in "tiers". > > Whatever naming scheme is used, the important part is how will users > find out what each schemata line actually means/controls. Agreed. That's a problem, but a separate one. [...] > > Or, did you have something else in mind? > > > > My thinking is that we avoid adding complexity in the schemata file if > > we treat mapping these schema instances onto the hardware topology as > > an orthogonal problem. So long as we have unique names in the schemata > > file, we can describe elsewhere what they relate to in the hardware. > > Yes, exactly this. OK, that's reassuring. Cheers ---Dave [1] https://lore.kernel.org/lkml/aNv53UmFGDBL0z3O@e133380.arm.com/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-22 15:04 ` Dave Martin 2025-09-25 22:58 ` Luck, Tony @ 2025-09-26 20:54 ` Reinette Chatre 2025-09-29 13:40 ` Dave Martin 1 sibling, 1 reply; 52+ messages in thread From: Reinette Chatre @ 2025-09-26 20:54 UTC (permalink / raw) To: Dave Martin, linux-kernel Cc: Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Dave, Just one correction ... On 9/22/25 8:04 AM, Dave Martin wrote: > On Fri, Sep 12, 2025 at 03:19:04PM -0700, Reinette Chatre wrote: > > [...] > >>> Clamping to bw_min and bw_max still feels generic: leave it in the core >>> code, for now. >> >> Sounds like MPAM may be ready to start the schema parsing discussion again? >> I understand that MPAM has a few more ways to describe memory bandwidth as >> well as cache portion partitioning. Previously ([1] [2]) James mused about exposing >> schema format to user space, which seems like a good idea for new schema. > > On this topic, specifically: > > > My own ideas in this area are a little different, though I agree with > the general idea. > > Bitmap controls are distinct from numeric values, but for numbers, I'm > not sure that distinguishing percentages from other values is required, > since this is really just a specific case of a linear scale. > > I imagined a generic numeric schema, described by a set of files like > the following in a schema's info directory: > > min: minimum value, e.g., 1 > max: maximum value, e.g., 1023 > scale: value that corresponds to one unit > unit: quantified base unit, e.g., "100pc", "64MBps" > map: mapping function name > > If s is the value written in a schemata entry and p is the > corresponding physical amount of resource, then > > min <= s <= max > > and > > p = map(s / scale) * unit > > One reason why I prefer this scaling scheme over the floating-point > approach is that it can be exact (at least for currently known > platforms), and it doesn't require a new floating-point parser/ > formatter to be written for this one thing in the kernel (which I > suspect is likely to be error-prone and poorly defined around > subtleties such as rounding behaviour). > > "map" anticipates non-linear ramps, but this is only really here as a > forwards compatibility get-out. For now, this might just be set to > "none", meaning the identity mapping (i.e., a no-op). This may shadow > the existing the "delay_linear" parameter, but with more general > applicabillity if we need it. > > > The idea is that userspace reads the info files and then does the > appropriate conversions itself. This might or might not be seen as a > burden, but would give exact control over the hardware configuration > with a generic interface, with possibly greater precision than the > existing schemata allow (when the hardware supports it), and without > having to second-guess the rounding that the kernel may or may not do > on the values. > > For RDT MBA, we might have > > min: 10 > max: 100 > scale: 100 > unit: 100pc > map: none > > The schemata entry > > MB: 0=10, 1=100 > > would allocate the minimum possible bandwidth to domain 0, and 100% > bandwidth to domain 1. > > > For AMD SMBA, we might have: > > min: 1 > max: 100 > scale: 8 > unit: 1GBps > Unfortunately not like this for AMD. Initial support for AMD MBA set max to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware. Of course this broke resctrl as a generic interface and I hope we learned enough since to not repeat this mistake nor give up on MB and make its interface even worse by, for example, adding more architecture specific input ranges. Reinette [1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature") [2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit") ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch 2025-09-26 20:54 ` Reinette Chatre @ 2025-09-29 13:40 ` Dave Martin 0 siblings, 0 replies; 52+ messages in thread From: Dave Martin @ 2025-09-29 13:40 UTC (permalink / raw) To: Reinette Chatre Cc: linux-kernel, Tony Luck, James Morse, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet, x86, linux-doc Hi Reinette, On Fri, Sep 26, 2025 at 01:54:10PM -0700, Reinette Chatre wrote: > Hi Dave, > > Just one correction ... > > On 9/22/25 8:04 AM, Dave Martin wrote: [...0 > > For AMD SMBA, we might have: > > > > min: 1 > > max: 100 > > scale: 8 > > unit: 1GBps > > > > Unfortunately not like this for AMD. Initial support for AMD MBA set max > to a hardcoded 2048 [1] that was later [2] modified to learn max from hardware. > Of course this broke resctrl as a generic interface and I hope we learned > enough since to not repeat this mistake nor give up on MB and make its interface > even worse by, for example, adding more architecture specific input ranges. > > Reinette > > [1] commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature") > [2] commit 0976783bb123 ("x86/resctrl: Remove hard-coded memory bandwidth limit") The "100" was just picked randomly in my example. Looking more carefully at the spec, it may make sense to have: min: 1 max: (1 << value of BW_LEN) scale: 8 unit: 1GBps (This max value correspondings to setting the "unlimited" bit in the control MSR; the other bits of the bandwidth value are then ignored. For this instance of the schema, programming the "max" value would be expected to give the nearest approximation to unlimited bandwidth that the hardware permits.) While the memory system is under-utilised end-to-end, I would expect throughput from a memory-bound job to scale linearly with the control value, but the control level at which throughput starts to saturate will depend on the pattern of load throughout the system. This seems fundamentally different from percentage controls -- it looks impossible to simulate proportional controls with absolute throughput controls, nor vice-versa (?) Cheers ---Dave ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2025-10-23 14:04 UTC | newest] Thread overview: 52+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-02 16:24 [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch Dave Martin 2025-09-12 22:19 ` Reinette Chatre 2025-09-22 14:39 ` Dave Martin 2025-09-23 17:27 ` Reinette Chatre 2025-09-25 12:46 ` Dave Martin 2025-09-25 20:53 ` Reinette Chatre 2025-09-25 21:35 ` Luck, Tony 2025-09-25 22:18 ` Reinette Chatre 2025-09-29 13:08 ` Dave Martin 2025-09-29 12:43 ` Dave Martin 2025-09-29 15:38 ` Reinette Chatre 2025-09-29 16:10 ` Dave Martin 2025-10-15 15:18 ` Dave Martin 2025-10-16 15:57 ` Reinette Chatre 2025-10-17 15:52 ` Dave Martin 2025-09-22 15:04 ` Dave Martin 2025-09-25 22:58 ` Luck, Tony 2025-09-29 9:19 ` Chen, Yu C 2025-09-29 14:13 ` Dave Martin 2025-09-29 16:23 ` Luck, Tony 2025-09-30 11:02 ` Chen, Yu C 2025-09-30 16:08 ` Luck, Tony 2025-09-30 4:43 ` Chen, Yu C 2025-09-30 15:55 ` Dave Martin 2025-10-01 12:13 ` Chen, Yu C 2025-10-02 15:40 ` Dave Martin 2025-10-02 16:43 ` Luck, Tony 2025-09-29 13:56 ` Dave Martin 2025-09-29 16:09 ` Reinette Chatre 2025-09-30 15:40 ` Dave Martin 2025-10-10 16:48 ` Reinette Chatre 2025-10-11 17:15 ` Chen, Yu C 2025-10-13 15:01 ` Dave Martin 2025-10-13 14:36 ` Dave Martin 2025-10-14 22:55 ` Reinette Chatre 2025-10-15 15:47 ` Dave Martin 2025-10-15 18:48 ` Luck, Tony 2025-10-16 14:50 ` Dave Martin 2025-10-16 16:31 ` Reinette Chatre 2025-10-17 14:17 ` Dave Martin 2025-10-17 15:59 ` Reinette Chatre 2025-10-20 15:50 ` Dave Martin 2025-10-20 16:31 ` Luck, Tony 2025-10-21 14:37 ` Dave Martin 2025-10-21 20:59 ` Luck, Tony 2025-10-22 14:58 ` Dave Martin 2025-10-22 16:21 ` Luck, Tony 2025-10-23 14:04 ` Dave Martin 2025-09-29 16:37 ` Luck, Tony 2025-09-30 16:02 ` Dave Martin 2025-09-26 20:54 ` Reinette Chatre 2025-09-29 13:40 ` Dave Martin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).