linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
@ 2025-10-31 15:41 Dave Martin
  2025-11-10 11:06 ` Ben Horgan
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Dave Martin @ 2025-10-31 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tony Luck, Reinette Chatre, James Morse, Ben Horgan,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jonathan Corbet, x86, linux-doc

The control value parser for the MB resource currently coerces the
memory bandwidth percentage value from userspace to be an exact
multiple of the rdt_resource::resctrl_membw::bw_gran parameter.

On MPAM systems, this results in somewhat worse-than-worst-case
rounding, since the bandwidth granularity advertised to resctrl by the
MPAM driver is in general only an approximation to the actual hardware
granularity on these systems, and the hardware bandwidth allocation
control value is not natively a percentage -- necessitating a further
conversion in the resctrl_arch_update_domains() path, regardless of the
conversion done at parse time.

Allow the arch to provide its own parse-time conversion that is
appropriate for the hardware, and move the existing conversion to x86.
This will avoid accumulated error from rounding the value twice on MPAM
systems.

Clarify the documentation, but avoid overly exact promises.

Clamping to bw_min and bw_max still feels generic: leave it in the core
code, for now.

No functional change.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>

---

Based on: v6.18-rc3

This is a minor update to v1 [1], with no functional differences.

Testing: see "Testing Notes", below.

Apologies for the long tearoff that follows...
For the actual patch, skip to the regex '^diff'.


Changes since v1:

 * Commit message updates: Fixed subsytem tag and casing of "MBA" in
   the subject line; named the affected parameter explicitly,
   rdt_resource::resctrl_membw::bw_gran.

 * Miscellenaeous function comment blocks and documentation text
   updated in response to review comments.

 * Updated documentation of bw_gran as per Reinette's suggestions.

 * Renamed resctrl_arch_round_bw() -> resctrl_arch_preconvert_bw(),
   since this is just a hook for the arch code to do something to a
   value that it will use later; the core code doesn't care what
   transformation is done here.  Clarified the accompanying comments.

 * Removed dangling forward declaration of struct rdt_resource in
   arch/x86/include/asm/resctrl.h.


Original tear-off notes from v1:

I put the x86 version out of line in order to avoid having to move
struct rdt_resource and its dependencies into resctrl_types.h -- which
would create a lot of diff noise.  Schemata writes from userspace have
a high overhead in any case.

For MPAM the conversion will be a no-op, because the incoming
percentage from the core resctrl code needs to be converted to hardware
representation in the driver anyway.

Perhaps _all_ the types should move to resctrl_types.h.

For now, I went for the smallest diffstat...

---
 Documentation/filesystems/resctrl.rst     | 17 +++++++++--------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  6 ++++++
 fs/resctrl/ctrlmondata.c                  |  6 +++---
 include/linux/resctrl.h                   | 14 ++++++++++++++
 4 files changed, 32 insertions(+), 11 deletions(-)

---

Testing Notes:

This patch should not make any functional change, but for completeness
I have included comparisons of the output of the restrcl kselftests.

See [3], [4] for detailed diffs.

(dmesg timestamp and PID changes squashed.)

The vanilla 6.18-rc1 and patches 6.18-rc1 test runs are those discussed
in [2] (see there for discussion of the out-of-tolerance results

---

References:

[1] [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
https://lore.kernel.org/lkml/20250902162507.18520-1-Dave.Martin@arm.com/

[2] Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
(test discussion subthread)
https://lore.kernel.org/lkml/aO+7MeSMV29VdbQs@e133380.arm.com/

Test diffs:

[3] resctrl_tests output diff
from vanilla 6.18-rc1 (busy, recently-booted system)
to this patch on 6.18-rc3 (quiet system).

# --- base-rc1/resctrl_tests_6.18.0-rc1.out	2025-10-14 17:11:56.000000000 +0100
# +++ test-rc3/resctrl_tests.out	2025-10-28 16:07:12.000000000 +0000
# @@ -1,132 +1,132 @@
#  TAP version 13
#  # Pass: Check kernel supports resctrl filesystem
#  # Pass: Check resctrl mountpoint "/sys/fs/resctrl" exists
#  # resctrl filesystem not mounted
#  # dmesg: [    1.397893] resctrl: L3 allocation detected
#  # dmesg: [    1.397928] resctrl: MB allocation detected
#  # dmesg: [    1.397961] resctrl: L3 monitoring detected
#  1..6
#  # Starting MBM test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Writing benchmark parameters to resctrl FS
#  # Benchmark PID: 998
#  # Write schema "MB:0=100" to resctrl FS
#  # Checking for pass/fail
#  # Pass: Check MBM diff within 8%
# -# avg_diff_per: 0%
# +# avg_diff_per: 1%
#  # Span (MB): 250
# -# avg_bw_imc: 6422
# -# avg_bw_resc: 6392
# +# avg_bw_imc: 6509
# +# avg_bw_resc: 6580
#  ok 1 MBM: test
#  # Starting MBA test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Writing benchmark parameters to resctrl FS
#  # Benchmark PID: 999
#  # Write schema "MB:0=10" to resctrl FS
#  # Write schema "MB:0=20" to resctrl FS
#  # Write schema "MB:0=30" to resctrl FS
#  # Write schema "MB:0=40" to resctrl FS
#  # Write schema "MB:0=50" to resctrl FS
#  # Write schema "MB:0=60" to resctrl FS
#  # Write schema "MB:0=70" to resctrl FS
#  # Write schema "MB:0=80" to resctrl FS
#  # Write schema "MB:0=90" to resctrl FS
#  # Write schema "MB:0=100" to resctrl FS
#  # Results are displayed in (MB)
#  # Pass: Check MBA diff within 8% for schemata 10
# -# avg_diff_per: 1%
# -# avg_bw_imc: 2033
# -# avg_bw_resc: 2012
# +# avg_diff_per: 0%
# +# avg_bw_imc: 1987
# +# avg_bw_resc: 2004
#  # Pass: Check MBA diff within 8% for schemata 20
#  # avg_diff_per: 0%
# -# avg_bw_imc: 3028
# +# avg_bw_imc: 2980
#  # avg_bw_resc: 3005
#  # Pass: Check MBA diff within 8% for schemata 30
#  # avg_diff_per: 0%
# -# avg_bw_imc: 3982
# -# avg_bw_resc: 3958
# +# avg_bw_imc: 3981
# +# avg_bw_resc: 4012
#  # Pass: Check MBA diff within 8% for schemata 40
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6265
# -# avg_bw_resc: 6236
# +# avg_bw_imc: 6302
# +# avg_bw_resc: 6351
#  # Pass: Check MBA diff within 8% for schemata 50
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6384
# -# avg_bw_resc: 6355
# +# avg_bw_imc: 6417
# +# avg_bw_resc: 6468
#  # Pass: Check MBA diff within 8% for schemata 60
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6405
# -# avg_bw_resc: 6376
# +# avg_bw_imc: 6475
# +# avg_bw_resc: 6527
#  # Pass: Check MBA diff within 8% for schemata 70
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6417
# -# avg_bw_resc: 6387
# +# avg_bw_imc: 6504
# +# avg_bw_resc: 6557
#  # Pass: Check MBA diff within 8% for schemata 80
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6418
# -# avg_bw_resc: 6394
# +# avg_bw_imc: 6502
# +# avg_bw_resc: 6556
#  # Pass: Check MBA diff within 8% for schemata 90
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6412
# -# avg_bw_resc: 6384
# +# avg_bw_imc: 6506
# +# avg_bw_resc: 6555
#  # Pass: Check MBA diff within 8% for schemata 100
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6425
# -# avg_bw_resc: 6399
# +# avg_bw_imc: 6518
# +# avg_bw_resc: 6572
#  # Pass: Check schemata change using MBA
#  ok 2 MBA: test
#  # Starting CMT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Cache size :23068672
#  # Writing benchmark parameters to resctrl FS
# -# Benchmark PID: 5135
# +# Benchmark PID: 1000
#  # Checking for pass/fail
# -# Fail: Check cache miss rate within 15%
# -# Percent diff=24
# +# Pass: Check cache miss rate within 15%
# +# Percent diff=9
#  # Number of bits: 5
# -# Average LLC val: 7942963
# +# Average LLC val: 9528934
#  # Cache span (bytes): 10485760
# -not ok 3 CMT: test
# +ok 3 CMT: test
#  # Starting L3_CAT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Cache size :23068672
#  # Writing benchmark parameters to resctrl FS
#  # Write schema "L3:0=1f0" to resctrl FS
#  # Write schema "L3:0=f" to resctrl FS
#  # Write schema "L3:0=1f8" to resctrl FS
#  # Write schema "L3:0=7" to resctrl FS
#  # Write schema "L3:0=1fc" to resctrl FS
#  # Write schema "L3:0=3" to resctrl FS
#  # Write schema "L3:0=1fe" to resctrl FS
#  # Write schema "L3:0=1" to resctrl FS
#  # Checking for pass/fail
#  # Number of bits: 4
# -# Average LLC val: 71434
# +# Average LLC val: 76870
#  # Cache span (lines): 131072
#  # Pass: Check cache miss rate changed more than 2.0%
# -# Percent diff=70.0
# +# Percent diff=63.7
#  # Number of bits: 3
# -# Average LLC val: 121463
# +# Average LLC val: 125814
#  # Cache span (lines): 98304
#  # Pass: Check cache miss rate changed more than 1.0%
# -# Percent diff=40.8
# +# Percent diff=41.0
#  # Number of bits: 2
# -# Average LLC val: 170978
# +# Average LLC val: 177346
#  # Cache span (lines): 65536
#  # Pass: Check cache miss rate changed more than 0.0%
# -# Percent diff=22.8
# +# Percent diff=20.8
#  # Number of bits: 1
# -# Average LLC val: 209950
# +# Average LLC val: 214175
#  # Cache span (lines): 32768
#  ok 4 L3_CAT: test
#  # Starting L3_NONCONT_CAT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Write schema "L3:0=3f" to resctrl FS
#  # Write schema "L3:0=787" to resctrl FS # write() failed : Invalid argument
#  # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
#  ok 5 L3_NONCONT_CAT: test
#  # Starting L2_NONCONT_CAT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  ok 6 # SKIP Hardware does not support L2_NONCONT_CAT or L2_NONCONT_CAT is disabled
#  # 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# -# Totals: pass:4 fail:1 xfail:0 xpass:0 skip:1 error:0
# +# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:1 error:0


[4] resctrl_tests output diff
from this patch on 6.18-rc1 (busy, recently-booted system)
to this patch on 6.18-rc3 (quiet system).

# --- test-rc1/resctrl_tests_6.18.0-rc1-test1.out	2025-10-14 17:21:44.000000000 +0100
# +++ test-rc3/resctrl_tests.out	2025-10-28 16:07:12.000000000 +0000
# @@ -1,132 +1,132 @@
#  TAP version 13
#  # Pass: Check kernel supports resctrl filesystem
#  # Pass: Check resctrl mountpoint "/sys/fs/resctrl" exists
#  # resctrl filesystem not mounted
#  # dmesg: [    1.397893] resctrl: L3 allocation detected
#  # dmesg: [    1.397928] resctrl: MB allocation detected
#  # dmesg: [    1.397961] resctrl: L3 monitoring detected
#  1..6
#  # Starting MBM test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Writing benchmark parameters to resctrl FS
#  # Benchmark PID: 998
#  # Write schema "MB:0=100" to resctrl FS
#  # Checking for pass/fail
#  # Pass: Check MBM diff within 8%
# -# avg_diff_per: 0%
# +# avg_diff_per: 1%
#  # Span (MB): 250
# -# avg_bw_imc: 6886
# -# avg_bw_resc: 6943
# +# avg_bw_imc: 6509
# +# avg_bw_resc: 6580
#  ok 1 MBM: test
#  # Starting MBA test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Writing benchmark parameters to resctrl FS
#  # Benchmark PID: 999
#  # Write schema "MB:0=10" to resctrl FS
#  # Write schema "MB:0=20" to resctrl FS
#  # Write schema "MB:0=30" to resctrl FS
#  # Write schema "MB:0=40" to resctrl FS
#  # Write schema "MB:0=50" to resctrl FS
#  # Write schema "MB:0=60" to resctrl FS
#  # Write schema "MB:0=70" to resctrl FS
#  # Write schema "MB:0=80" to resctrl FS
#  # Write schema "MB:0=90" to resctrl FS
#  # Write schema "MB:0=100" to resctrl FS
#  # Results are displayed in (MB)
#  # Pass: Check MBA diff within 8% for schemata 10
#  # avg_diff_per: 0%
# -# avg_bw_imc: 2028
# -# avg_bw_resc: 2032
# +# avg_bw_imc: 1987
# +# avg_bw_resc: 2004
#  # Pass: Check MBA diff within 8% for schemata 20
#  # avg_diff_per: 0%
# -# avg_bw_imc: 3006
# -# avg_bw_resc: 3011
# +# avg_bw_imc: 2980
# +# avg_bw_resc: 3005
#  # Pass: Check MBA diff within 8% for schemata 30
#  # avg_diff_per: 0%
# -# avg_bw_imc: 4006
# -# avg_bw_resc: 4013
# +# avg_bw_imc: 3981
# +# avg_bw_resc: 4012
#  # Pass: Check MBA diff within 8% for schemata 40
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6726
# -# avg_bw_resc: 6732
# +# avg_bw_imc: 6302
# +# avg_bw_resc: 6351
#  # Pass: Check MBA diff within 8% for schemata 50
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6854
# -# avg_bw_resc: 6856
# +# avg_bw_imc: 6417
# +# avg_bw_resc: 6468
#  # Pass: Check MBA diff within 8% for schemata 60
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6882
# -# avg_bw_resc: 6883
# +# avg_bw_imc: 6475
# +# avg_bw_resc: 6527
#  # Pass: Check MBA diff within 8% for schemata 70
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6891
# -# avg_bw_resc: 6889
# +# avg_bw_imc: 6504
# +# avg_bw_resc: 6557
#  # Pass: Check MBA diff within 8% for schemata 80
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6893
# -# avg_bw_resc: 6909
# +# avg_bw_imc: 6502
# +# avg_bw_resc: 6556
#  # Pass: Check MBA diff within 8% for schemata 90
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6890
# -# avg_bw_resc: 6888
# +# avg_bw_imc: 6506
# +# avg_bw_resc: 6555
#  # Pass: Check MBA diff within 8% for schemata 100
#  # avg_diff_per: 0%
# -# avg_bw_imc: 6929
# -# avg_bw_resc: 6951
# +# avg_bw_imc: 6518
# +# avg_bw_resc: 6572
#  # Pass: Check schemata change using MBA
#  ok 2 MBA: test
#  # Starting CMT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Cache size :23068672
#  # Writing benchmark parameters to resctrl FS
#  # Benchmark PID: 1000
#  # Checking for pass/fail
#  # Pass: Check cache miss rate within 15%
# -# Percent diff=4
# +# Percent diff=9
#  # Number of bits: 5
# -# Average LLC val: 10918297
# +# Average LLC val: 9528934
#  # Cache span (bytes): 10485760
#  ok 3 CMT: test
#  # Starting L3_CAT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Cache size :23068672
#  # Writing benchmark parameters to resctrl FS
#  # Write schema "L3:0=1f0" to resctrl FS
#  # Write schema "L3:0=f" to resctrl FS
#  # Write schema "L3:0=1f8" to resctrl FS
#  # Write schema "L3:0=7" to resctrl FS
#  # Write schema "L3:0=1fc" to resctrl FS
#  # Write schema "L3:0=3" to resctrl FS
#  # Write schema "L3:0=1fe" to resctrl FS
#  # Write schema "L3:0=1" to resctrl FS
#  # Checking for pass/fail
#  # Number of bits: 4
# -# Average LLC val: 70161
# +# Average LLC val: 76870
#  # Cache span (lines): 131072
#  # Pass: Check cache miss rate changed more than 2.0%
# -# Percent diff=72.1
# +# Percent diff=63.7
#  # Number of bits: 3
# -# Average LLC val: 120755
# +# Average LLC val: 125814
#  # Cache span (lines): 98304
#  # Pass: Check cache miss rate changed more than 1.0%
# -# Percent diff=42.5
# +# Percent diff=41.0
#  # Number of bits: 2
# -# Average LLC val: 172077
# +# Average LLC val: 177346
#  # Cache span (lines): 65536
#  # Pass: Check cache miss rate changed more than 0.0%
# -# Percent diff=22.0
# +# Percent diff=20.8
#  # Number of bits: 1
# -# Average LLC val: 209893
# +# Average LLC val: 214175
#  # Cache span (lines): 32768
#  ok 4 L3_CAT: test
#  # Starting L3_NONCONT_CAT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  # Write schema "L3:0=3f" to resctrl FS
#  # Write schema "L3:0=787" to resctrl FS # write() failed : Invalid argument
#  # Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected
#  ok 5 L3_NONCONT_CAT: test
#  # Starting L2_NONCONT_CAT test ...
#  # Mounting resctrl to "/sys/fs/resctrl"
#  ok 6 # SKIP Hardware does not support L2_NONCONT_CAT or L2_NONCONT_CAT is disabled
#  # 1 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
#  # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:1 error:0


diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index b7f35b07876a..fbbcf5421346 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -144,12 +144,11 @@ with respect to allocation:
 		user can request.
 
 "bandwidth_gran":
-		The granularity in which the memory bandwidth
-		percentage is allocated. The allocated
-		b/w percentage is rounded off to the next
-		control step available on the hardware. The
-		available bandwidth control steps are:
-		min_bandwidth + N * bandwidth_gran.
+		The approximate granularity in which the memory bandwidth
+		percentage is allocated. The allocated bandwidth percentage
+		is rounded up to the next control step available on the
+		hardware. The available hardware steps are no larger than
+		this value.
 
 "delay_linear":
 		Indicates if the delay scale is linear or
@@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
 and can be looked up through "info/MB/min_bandwidth". The bandwidth
 granularity that is allocated is also dependent on the cpu model and can
 be looked up at "info/MB/bandwidth_gran". The available bandwidth
-control steps are: min_bw + N * bw_gran. Intermediate values are rounded
-to the next control step available on the hardware.
+control steps are, approximately, min_bw + N * bw_gran.  The steps may
+appear irregular due to rounding to an exact percentage: bw_gran is the
+maximum interval between the percentage values corresponding to any two
+adjacent steps in the hardware.
 
 The bandwidth throttling is a core specific mechanism on some of Intel
 SKUs. Using a high bandwidth and a low bandwidth setting on two threads
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 1189c0df4ad7..dc05a50e80fb 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -16,9 +16,15 @@
 #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
 
 #include <linux/cpu.h>
+#include <linux/math.h>
 
 #include "internal.h"
 
+u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r)
+{
+	return roundup(val, (unsigned long)r->membw.bw_gran);
+}
+
 int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 0d0ef54fc4de..26e3f7c88c86 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -35,8 +35,8 @@ typedef int (ctrlval_parser_t)(struct rdt_parse_data *data,
 /*
  * Check whether MBA bandwidth percentage value is correct. The value is
  * checked against the minimum and max bandwidth values specified by the
- * hardware. The allocated bandwidth percentage is rounded to the next
- * control step available on the hardware.
+ * hardware. The allocated bandwidth percentage is converted as
+ * appropriate for consumption by the specific hardware driver.
  */
 static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
 {
@@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
 		return false;
 	}
 
-	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
+	*data = resctrl_arch_preconvert_bw(bw, r);
 	return true;
 }
 
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a7d92718b653..1fb6e2118b61 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -485,6 +485,20 @@ bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
  */
 int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
 
+/*
+ * Convert a bandwidth control value to the appropriate form for
+ * consumption by the hardware driver for resource r.
+ *
+ * For example, it simplifies the x86 RDT implementation to round the
+ * value to a suitable step here and then treat the resulting value as
+ * opaque when programming the hardware MSRs later on.
+ *
+ * Architectures for which this pre-conversion hook is not useful
+ * should supply an implementation of this function that just returns
+ * val unmodified.
+ */
+u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r);
+
 /*
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.

base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-10-31 15:41 [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch Dave Martin
@ 2025-11-10 11:06 ` Ben Horgan
  2025-11-20 15:30   ` Dave Martin
  2025-11-14 22:17 ` Reinette Chatre
  2025-11-20 16:46 ` Reinette Chatre
  2 siblings, 1 reply; 11+ messages in thread
From: Ben Horgan @ 2025-11-10 11:06 UTC (permalink / raw)
  To: Dave Martin, linux-kernel
  Cc: Tony Luck, Reinette Chatre, James Morse, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jonathan Corbet, x86, linux-doc

Hi Dave,

On 10/31/25 15:41, Dave Martin wrote:
> The control value parser for the MB resource currently coerces the
> memory bandwidth percentage value from userspace to be an exact
> multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
> 
> On MPAM systems, this results in somewhat worse-than-worst-case
> rounding, since the bandwidth granularity advertised to resctrl by the
> MPAM driver is in general only an approximation to the actual hardware
> granularity on these systems, and the hardware bandwidth allocation
> control value is not natively a percentage -- necessitating a further
> conversion in the resctrl_arch_update_domains() path, regardless of the
> conversion done at parse time.
> 
> Allow the arch to provide its own parse-time conversion that is
> appropriate for the hardware, and move the existing conversion to x86.
> This will avoid accumulated error from rounding the value twice on MPAM
> systems.
> 
> Clarify the documentation, but avoid overly exact promises.
> 
> Clamping to bw_min and bw_max still feels generic: leave it in the core
> code, for now.
> 
> No functional change.
> 
> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> 
Seems sensible and helpful for MPAM.

Reviewed-by: Ben Horgan <ben.horgan@arm.com>

Thanks,

Ben


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-10-31 15:41 [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch Dave Martin
  2025-11-10 11:06 ` Ben Horgan
@ 2025-11-14 22:17 ` Reinette Chatre
  2025-11-18 15:47   ` Dave Martin
  2025-11-20 16:46 ` Reinette Chatre
  2 siblings, 1 reply; 11+ messages in thread
From: Reinette Chatre @ 2025-11-14 22:17 UTC (permalink / raw)
  To: Dave Martin, linux-kernel
  Cc: Tony Luck, James Morse, Ben Horgan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
	x86, linux-doc

Hi Dave,

On 10/31/25 8:41 AM, Dave Martin wrote:
> The control value parser for the MB resource currently coerces the
> memory bandwidth percentage value from userspace to be an exact
> multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
> 
> On MPAM systems, this results in somewhat worse-than-worst-case
> rounding, since the bandwidth granularity advertised to resctrl by the
> MPAM driver is in general only an approximation to the actual hardware
> granularity on these systems, and the hardware bandwidth allocation
> control value is not natively a percentage -- necessitating a further
> conversion in the resctrl_arch_update_domains() path, regardless of the
> conversion done at parse time.
> 
> Allow the arch to provide its own parse-time conversion that is
> appropriate for the hardware, and move the existing conversion to x86.
> This will avoid accumulated error from rounding the value twice on MPAM
> systems.
> 
> Clarify the documentation, but avoid overly exact promises.
> 
> Clamping to bw_min and bw_max still feels generic: leave it in the core
> code, for now.

I think they are only theoretically generic since arch sets them and resctrl
uses to enforce user input. Arch can thus theoretically set them to whatever
the u32 used to represent it allows. Of course, doing something like this makes
the interface even harder for users to use.

> 
> No functional change.
> 
> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> 
> ---

...

> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index b7f35b07876a..fbbcf5421346 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -144,12 +144,11 @@ with respect to allocation:
>  		user can request.
>  
>  "bandwidth_gran":
> -		The granularity in which the memory bandwidth
> -		percentage is allocated. The allocated
> -		b/w percentage is rounded off to the next
> -		control step available on the hardware. The
> -		available bandwidth control steps are:
> -		min_bandwidth + N * bandwidth_gran.
> +		The approximate granularity in which the memory bandwidth
> +		percentage is allocated. The allocated bandwidth percentage
> +		is rounded up to the next control step available on the
> +		hardware. The available hardware steps are no larger than
> +		this value.
>  
>  "delay_linear":
>  		Indicates if the delay scale is linear or
> @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
>  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>  granularity that is allocated is also dependent on the cpu model and can
>  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> -to the next control step available on the hardware.
> +control steps are, approximately, min_bw + N * bw_gran.  The steps may
> +appear irregular due to rounding to an exact percentage: bw_gran is the
> +maximum interval between the percentage values corresponding to any two
> +adjacent steps in the hardware.

What can bw_gran be expected to be on Arm systems? Could existing usage be supported with
MPAM setting bw_gran to 1?

What will these control steps actually look like when the user views the schemata file
on an Arm system?

With resctrl "coercing" the user provided value before providing it to the architecture
it controls these control steps to match what the documentation states above. If resctrl
instead provides the value directly to the architecture I see nothing preventing the
architecture from ignoring resctrl's "contract" with user space documented above and
using arbitrary control steps since it also controls resctrl_arch_get_config() that is
displayed directly to user space. What guarantee is there that resctrl_arch_get_config()
will display a value that is "approximately" min_bw + N * bw_gran? This seems like opening
the door even wider for resctrl to become architecture specific ... with this change the
schemata file becomes a direct channel between user space and the arch that risks users
needing to tread carefully when switching between different architectures.

Reinette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-11-14 22:17 ` Reinette Chatre
@ 2025-11-18 15:47   ` Dave Martin
  2025-11-20  3:05     ` Reinette Chatre
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Martin @ 2025-11-18 15:47 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: linux-kernel, Tony Luck, James Morse, Ben Horgan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jonathan Corbet, x86, linux-doc

Hi Reinette,

On Fri, Nov 14, 2025 at 02:17:53PM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 10/31/25 8:41 AM, Dave Martin wrote:
> > The control value parser for the MB resource currently coerces the
> > memory bandwidth percentage value from userspace to be an exact
> > multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
> > 
> > On MPAM systems, this results in somewhat worse-than-worst-case
> > rounding, since the bandwidth granularity advertised to resctrl by the
> > MPAM driver is in general only an approximation to the actual hardware
> > granularity on these systems, and the hardware bandwidth allocation
> > control value is not natively a percentage -- necessitating a further
> > conversion in the resctrl_arch_update_domains() path, regardless of the
> > conversion done at parse time.
> > 
> > Allow the arch to provide its own parse-time conversion that is
> > appropriate for the hardware, and move the existing conversion to x86.
> > This will avoid accumulated error from rounding the value twice on MPAM
> > systems.
> > 
> > Clarify the documentation, but avoid overly exact promises.
> > 
> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
> 
> I think they are only theoretically generic since arch sets them and resctrl
> uses to enforce user input. Arch can thus theoretically set them to whatever
> the u32 used to represent it allows. Of course, doing something like this makes
> the interface even harder for users to use.

Hence, "feels".

This could perhaps be refined, but I didn't see an obvious reason to
change the way this works, right now.

Or, is there a problem here that I'm missing?

> 
> > 
> > No functional change.
> > 
> > Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> > 
> > ---
> 
> ...
> 
> > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> > index b7f35b07876a..fbbcf5421346 100644
> > --- a/Documentation/filesystems/resctrl.rst
> > +++ b/Documentation/filesystems/resctrl.rst
> > @@ -144,12 +144,11 @@ with respect to allocation:
> >  		user can request.
> >  
> >  "bandwidth_gran":
> > -		The granularity in which the memory bandwidth
> > -		percentage is allocated. The allocated
> > -		b/w percentage is rounded off to the next
> > -		control step available on the hardware. The
> > -		available bandwidth control steps are:
> > -		min_bandwidth + N * bandwidth_gran.
> > +		The approximate granularity in which the memory bandwidth
> > +		percentage is allocated. The allocated bandwidth percentage
> > +		is rounded up to the next control step available on the
> > +		hardware. The available hardware steps are no larger than
> > +		this value.
> >  
> >  "delay_linear":
> >  		Indicates if the delay scale is linear or
> > @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
> >  and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >  granularity that is allocated is also dependent on the cpu model and can
> >  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> > -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> > -to the next control step available on the hardware.
> > +control steps are, approximately, min_bw + N * bw_gran.  The steps may
> > +appear irregular due to rounding to an exact percentage: bw_gran is the
> > +maximum interval between the percentage values corresponding to any two
> > +adjacent steps in the hardware.
> 
> What can bw_gran be expected to be on Arm systems? Could existing usage be supported with
> MPAM setting bw_gran to 1?

Architecturally, ceil(100.0 / (1 << b)), where 1 <= b <= 16.

So, the possible values are

	1, 2, 4, 7, 13, 25, 50

(with 25 and 50 being the only ones that are exact).

In practice, something like 6 <= b <= 8 is probably more typical; this
would give advertised bandwidth_gran values of 2 or 1.


Re your question about existing usage, were you suggesting
unconditionally setting bw_gran to 1?

Since the values are converted to/from hardware representation in
resctrl_arch_update_domains() / resctrl_arch_get_config(), we
don't have a problem, provided that the value doesn't get rounded in
advance by bw_validate().

But if bw_gran is always 1, it will mislead userspace about the
precision.  This feels stranger for userspace than fine differences in
precisely which percentage values get read out of schemata.

> What will these control steps actually look like when the user views the schemata file
> on an Arm system?

It depends on the number of bits in the hardware value (the parameter b
above).  Picking the smallest, non-trivial value 3:

	schemata	hardware (MBW_MAX)

	 13		0
	 25		1
	 38		2
	 50		3
	 63		4
	 75		5
	 88		6
	100		7

As currently implemented, I believe that writing the values from the
"schemata" column above will result in the corresponding values from
the "hardware" column being written to the hardware.  Achitecturally,
there is no guaranteed representation for 0%; we would advertise min=13,
max=100 in info/.)

A round-to-nearest policy is followed for intermediate values on write.

	(hardware value = round(schemata value * 8.0 / 100.0) - 1).)

Reading the value back translates the value from the "hardware" column
back to the unique value in the "schemata" column.

	(schemata value = round((hardware value + 1) * 100.0 / (1 << b)).)


In this case, bandwidth_gran would be advertised as 13, which is the
largest step that can be seen in the read-back values.


I would rather avoid promising precisely which values can be read out;
merely that they are consistent with the approximate precision defined
by the bandwidth_gran parameter.

> 
> With resctrl "coercing" the user provided value before providing it to the architecture
> it controls these control steps to match what the documentation states above. If resctrl
> instead provides the value directly to the architecture I see nothing preventing the
> architecture from ignoring resctrl's "contract" with user space documented above and
> using arbitrary control steps since it also controls resctrl_arch_get_config() that is
> displayed directly to user space. What guarantee is there that resctrl_arch_get_config()
> will display a value that is "approximately" min_bw + N * bw_gran? This seems like opening

No guarantee, but there will only be one resctrl_arch_preconvert_bw()
per arch.  We'd expect the functions to be simple, so this doesn't feel
like an excessive maintenance burden?

(All the resctrl_arch_foo() hooks have to do the right thing after all,
otherwise nothing will work.)


With this patch, resctrl_arch_preconvert_bw() and
resctrl_arch_{update_domains,get_config}() between them must provide
the correct behaviour.

But even today, resctrl_arch_update_domains() and
resctrl_arch_get_config() have to do the right thing in order for
bandwidth control values to be handled correctly, as seen through the
schemata interface.

(x86's job is easy, because the generic interface between the resctrl
core and the arch interface happens to be expressed in terms of raw x86
MSR values due to the history.  But other arches don't get the benefit
of that.)


The reason for this patch is the generic code can't do the right thing
for MPAM, unless there is a hook into the arch code, arch-/hardware-
specific knowledge is added in the core code, or unless a misleading
bw_gran value is advertised.

I tried to take the pragmatic approach, but I'm open to suggestions if
you'd prefer this to be factored in another way.

> the door even wider for resctrl to become architecture specific ... with this change the
> schemata file becomes a direct channel between user space and the arch that risks users
> needing to tread carefully when switching between different architectures.

There doesn't feel like a magic solution here.

Exact bandwidth and flow control behaviour is extremely dependent on
hardware topology and the design of interconnects and on dynamic system
load.  An interface that is generic and rigorously defined, yet also
simple enough to be reasonably usable by portable software would
probably not be implementable on any real hardware platform.

So, if it's useful to have a generic interface at all, hardware and
software are going to have to meet in the middle somewhere...

(The historical behaviour -- and even the interface -- of MB is not
generic either, right?)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-11-18 15:47   ` Dave Martin
@ 2025-11-20  3:05     ` Reinette Chatre
  2025-11-20 17:04       ` Dave Martin
  0 siblings, 1 reply; 11+ messages in thread
From: Reinette Chatre @ 2025-11-20  3:05 UTC (permalink / raw)
  To: Dave Martin
  Cc: linux-kernel, Tony Luck, James Morse, Ben Horgan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jonathan Corbet, x86, linux-doc

Hi Dave,

On 11/18/25 7:47 AM, Dave Martin wrote:
> Hi Reinette,
> 
> On Fri, Nov 14, 2025 at 02:17:53PM -0800, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/31/25 8:41 AM, Dave Martin wrote:
>>> The control value parser for the MB resource currently coerces the
>>> memory bandwidth percentage value from userspace to be an exact
>>> multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
>>>
>>> On MPAM systems, this results in somewhat worse-than-worst-case
>>> rounding, since the bandwidth granularity advertised to resctrl by the
>>> MPAM driver is in general only an approximation to the actual hardware
>>> granularity on these systems, and the hardware bandwidth allocation
>>> control value is not natively a percentage -- necessitating a further
>>> conversion in the resctrl_arch_update_domains() path, regardless of the
>>> conversion done at parse time.
>>>
>>> Allow the arch to provide its own parse-time conversion that is
>>> appropriate for the hardware, and move the existing conversion to x86.
>>> This will avoid accumulated error from rounding the value twice on MPAM
>>> systems.
>>>
>>> Clarify the documentation, but avoid overly exact promises.
>>>
>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>> code, for now.
>>
>> I think they are only theoretically generic since arch sets them and resctrl
>> uses to enforce user input. Arch can thus theoretically set them to whatever
>> the u32 used to represent it allows. Of course, doing something like this makes
>> the interface even harder for users to use.
> 
> Hence, "feels".
> 
> This could perhaps be refined, but I didn't see an obvious reason to
> change the way this works, right now.
> 
> Or, is there a problem here that I'm missing?

No. We are in agreement.

> 
>>
>>>
>>> No functional change.
>>>
>>> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
>>>
>>> ---
>>
>> ...
>>
>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>>> index b7f35b07876a..fbbcf5421346 100644
>>> --- a/Documentation/filesystems/resctrl.rst
>>> +++ b/Documentation/filesystems/resctrl.rst
>>> @@ -144,12 +144,11 @@ with respect to allocation:
>>>  		user can request.
>>>  
>>>  "bandwidth_gran":
>>> -		The granularity in which the memory bandwidth
>>> -		percentage is allocated. The allocated
>>> -		b/w percentage is rounded off to the next
>>> -		control step available on the hardware. The
>>> -		available bandwidth control steps are:
>>> -		min_bandwidth + N * bandwidth_gran.
>>> +		The approximate granularity in which the memory bandwidth
>>> +		percentage is allocated. The allocated bandwidth percentage
>>> +		is rounded up to the next control step available on the
>>> +		hardware. The available hardware steps are no larger than
>>> +		this value.
>>>  
>>>  "delay_linear":
>>>  		Indicates if the delay scale is linear or
>>> @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
>>>  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>>  granularity that is allocated is also dependent on the cpu model and can
>>>  be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>> -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>> -to the next control step available on the hardware.
>>> +control steps are, approximately, min_bw + N * bw_gran.  The steps may
>>> +appear irregular due to rounding to an exact percentage: bw_gran is the
>>> +maximum interval between the percentage values corresponding to any two
>>> +adjacent steps in the hardware.
>>
>> What can bw_gran be expected to be on Arm systems? Could existing usage be supported with
>> MPAM setting bw_gran to 1?
> 
> Architecturally, ceil(100.0 / (1 << b)), where 1 <= b <= 16.
> 
> So, the possible values are
> 
> 	1, 2, 4, 7, 13, 25, 50
> 
> (with 25 and 50 being the only ones that are exact).
> 
> In practice, something like 6 <= b <= 8 is probably more typical; this
> would give advertised bandwidth_gran values of 2 or 1.
> 
> 
> Re your question about existing usage, were you suggesting
> unconditionally setting bw_gran to 1?

Yes, I *was* suggesting that. I considered bw_gran of 1 to be a "safe" value that
provides architecture with most flexibility. Now that you have provided more
insight on how MPAM uses this value I agree that bw_gran of 1 is not appropriate
for MPAM.

> 
> Since the values are converted to/from hardware representation in
> resctrl_arch_update_domains() / resctrl_arch_get_config(), we
> don't have a problem, provided that the value doesn't get rounded in
> advance by bw_validate().
> 
> But if bw_gran is always 1, it will mislead userspace about the
> precision.  This feels stranger for userspace than fine differences in
> precisely which percentage values get read out of schemata.
> 
>> What will these control steps actually look like when the user views the schemata file
>> on an Arm system?
> 
> It depends on the number of bits in the hardware value (the parameter b
> above).  Picking the smallest, non-trivial value 3:
> 
> 	schemata	hardware (MBW_MAX)
> 
> 	 13		0
> 	 25		1
> 	 38		2
> 	 50		3
> 	 63		4
> 	 75		5
> 	 88		6
> 	100		7
> 
> As currently implemented, I believe that writing the values from the
> "schemata" column above will result in the corresponding values from
> the "hardware" column being written to the hardware.  Achitecturally,
> there is no guaranteed representation for 0%; we would advertise min=13,
> max=100 in info/.)
> 
> A round-to-nearest policy is followed for intermediate values on write.
> 
> 	(hardware value = round(schemata value * 8.0 / 100.0) - 1).)
> 
> Reading the value back translates the value from the "hardware" column
> back to the unique value in the "schemata" column.
> 
> 	(schemata value = round((hardware value + 1) * 100.0 / (1 << b)).)
> 
> 
> In this case, bandwidth_gran would be advertised as 13, which is the
> largest step that can be seen in the read-back values.

Thank you very much for these details. I appreciate having a better understanding
on how the hardware enforces these control steps instead of some software emulation.
With these steps enforced in this way on the architecture side it creates 
confidence that the user space expectations from interface can be met directly
by architecture.

> I would rather avoid promising precisely which values can be read out;
> merely that they are consistent with the approximate precision defined
> by the bandwidth_gran parameter.

ack. I believe the accompanied changes to resctrl.rst supports this.

> 
>>
>> With resctrl "coercing" the user provided value before providing it to the architecture
>> it controls these control steps to match what the documentation states above. If resctrl
>> instead provides the value directly to the architecture I see nothing preventing the
>> architecture from ignoring resctrl's "contract" with user space documented above and
>> using arbitrary control steps since it also controls resctrl_arch_get_config() that is
>> displayed directly to user space. What guarantee is there that resctrl_arch_get_config()
>> will display a value that is "approximately" min_bw + N * bw_gran? This seems like opening
> 
> No guarantee, but there will only be one resctrl_arch_preconvert_bw()
> per arch.  We'd expect the functions to be simple, so this doesn't feel
> like an excessive maintenance burden?

Agree.

> 
> (All the resctrl_arch_foo() hooks have to do the right thing after all,
> otherwise nothing will work.)
> 
> 
> With this patch, resctrl_arch_preconvert_bw() and
> resctrl_arch_{update_domains,get_config}() between them must provide
> the correct behaviour.

Right. This blurs the lines of responsibility. I interpret this as:
"resctrl fs makes promises to user space that resctrl fs *and* the architecture are
 responsible to keep"

> 
> But even today, resctrl_arch_update_domains() and
> resctrl_arch_get_config() have to do the right thing in order for
> bandwidth control values to be handled correctly, as seen through the
> schemata interface.

ack.

> 
> (x86's job is easy, because the generic interface between the resctrl
> core and the arch interface happens to be expressed in terms of raw x86
> MSR values due to the history.  But other arches don't get the benefit
> of that.)

That is just the benefit of being the first architecture to be supported.
If determined to cause headaches elsewhere it can surely change.
> The reason for this patch is the generic code can't do the right thing
> for MPAM, unless there is a hook into the arch code, arch-/hardware-
> specific knowledge is added in the core code, or unless a misleading
> bw_gran value is advertised.
Understood.

> 
> I tried to take the pragmatic approach, but I'm open to suggestions if
> you'd prefer this to be factored in another way.

I am ok with this approach and will respond to the patch details separately.

> 
>> the door even wider for resctrl to become architecture specific ... with this change the
>> schemata file becomes a direct channel between user space and the arch that risks users
>> needing to tread carefully when switching between different architectures.
> 
> There doesn't feel like a magic solution here.
> 
> Exact bandwidth and flow control behaviour is extremely dependent on
> hardware topology and the design of interconnects and on dynamic system
> load.  An interface that is generic and rigorously defined, yet also
> simple enough to be reasonably usable by portable software would
> probably not be implementable on any real hardware platform.
> 
> So, if it's useful to have a generic interface at all, hardware and
> software are going to have to meet in the middle somewhere...

I believe that we could also use above as a quote in support of the schema
description work.

> 
> (The historical behaviour -- and even the interface -- of MB is not
> generic either, right?)

Right. Even so, I prefer to use MB as motivation to do things better rather
than an excuse to make things worse.

Reinette



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-11-10 11:06 ` Ben Horgan
@ 2025-11-20 15:30   ` Dave Martin
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Martin @ 2025-11-20 15:30 UTC (permalink / raw)
  To: Ben Horgan
  Cc: linux-kernel, Tony Luck, Reinette Chatre, James Morse,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jonathan Corbet, x86, linux-doc

Hi Ben,

On Mon, Nov 10, 2025 at 11:06:32AM +0000, Ben Horgan wrote:
> Hi Dave,
> 
> On 10/31/25 15:41, Dave Martin wrote:
> > The control value parser for the MB resource currently coerces the
> > memory bandwidth percentage value from userspace to be an exact
> > multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
> > 
> > On MPAM systems, this results in somewhat worse-than-worst-case
> > rounding, since the bandwidth granularity advertised to resctrl by the
> > MPAM driver is in general only an approximation to the actual hardware
> > granularity on these systems, and the hardware bandwidth allocation
> > control value is not natively a percentage -- necessitating a further
> > conversion in the resctrl_arch_update_domains() path, regardless of the
> > conversion done at parse time.
> > 
> > Allow the arch to provide its own parse-time conversion that is
> > appropriate for the hardware, and move the existing conversion to x86.
> > This will avoid accumulated error from rounding the value twice on MPAM
> > systems.
> > 
> > Clarify the documentation, but avoid overly exact promises.
> > 
> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
> > 
> > No functional change.
> > 
> > Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> > 
> Seems sensible and helpful for MPAM.
> 
> Reviewed-by: Ben Horgan <ben.horgan@arm.com>

Thanks for taking a look.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-10-31 15:41 [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch Dave Martin
  2025-11-10 11:06 ` Ben Horgan
  2025-11-14 22:17 ` Reinette Chatre
@ 2025-11-20 16:46 ` Reinette Chatre
  2025-11-20 17:42   ` Dave Martin
  2 siblings, 1 reply; 11+ messages in thread
From: Reinette Chatre @ 2025-11-20 16:46 UTC (permalink / raw)
  To: Dave Martin, linux-kernel
  Cc: Tony Luck, James Morse, Ben Horgan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jonathan Corbet,
	x86, linux-doc

Hi Dave,

On 10/31/25 8:41 AM, Dave Martin wrote:
> The control value parser for the MB resource currently coerces the
> memory bandwidth percentage value from userspace to be an exact
> multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
> 
> On MPAM systems, this results in somewhat worse-than-worst-case
> rounding, since the bandwidth granularity advertised to resctrl by the
> MPAM driver is in general only an approximation to the actual hardware
> granularity on these systems, and the hardware bandwidth allocation
> control value is not natively a percentage -- necessitating a further
> conversion in the resctrl_arch_update_domains() path, regardless of the
> conversion done at parse time.
> 
> Allow the arch to provide its own parse-time conversion that is
> appropriate for the hardware, and move the existing conversion to x86.
> This will avoid accumulated error from rounding the value twice on MPAM
> systems.
> 
> Clarify the documentation, but avoid overly exact promises.
> 
> Clamping to bw_min and bw_max still feels generic: leave it in the core
> code, for now.
> 
> No functional change.

Please use max line length available. Changelogs have to conform before merge
and having patch ready saves this work.

> 
> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> 
> ---

...

> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index b7f35b07876a..fbbcf5421346 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -144,12 +144,11 @@ with respect to allocation:
>  		user can request.
>  
>  "bandwidth_gran":
> -		The granularity in which the memory bandwidth
> -		percentage is allocated. The allocated
> -		b/w percentage is rounded off to the next
> -		control step available on the hardware. The
> -		available bandwidth control steps are:
> -		min_bandwidth + N * bandwidth_gran.
> +		The approximate granularity in which the memory bandwidth
> +		percentage is allocated. The allocated bandwidth percentage
> +		is rounded up to the next control step available on the
> +		hardware. The available hardware steps are no larger than
> +		this value.
>  
>  "delay_linear":
>  		Indicates if the delay scale is linear or
> @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
>  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>  granularity that is allocated is also dependent on the cpu model and can
>  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> -to the next control step available on the hardware.
> +control steps are, approximately, min_bw + N * bw_gran.  The steps may
> +appear irregular due to rounding to an exact percentage: bw_gran is the
> +maximum interval between the percentage values corresponding to any two
> +adjacent steps in the hardware.
>  
>  The bandwidth throttling is a core specific mechanism on some of Intel
>  SKUs. Using a high bandwidth and a low bandwidth setting on two threads

The documentation changes look good to me. Thank you.

> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1189c0df4ad7..dc05a50e80fb 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -16,9 +16,15 @@
>  #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
>  
>  #include <linux/cpu.h>
> +#include <linux/math.h>
>  
>  #include "internal.h"
>  
> +u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r)
> +{
> +	return roundup(val, (unsigned long)r->membw.bw_gran);
> +}
> +
>  int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
>  			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
>  {
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index 0d0ef54fc4de..26e3f7c88c86 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -35,8 +35,8 @@ typedef int (ctrlval_parser_t)(struct rdt_parse_data *data,
>  /*
>   * Check whether MBA bandwidth percentage value is correct. The value is
>   * checked against the minimum and max bandwidth values specified by the
> - * hardware. The allocated bandwidth percentage is rounded to the next
> - * control step available on the hardware.
> + * hardware. The allocated bandwidth percentage is converted as
> + * appropriate for consumption by the specific hardware driver.

The text looks good but adjusting the right margin mid-paragraph seems unnecessary?

>   */
>  static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
>  {
> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
>  		return false;
>  	}
>  
> -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
> +	*data = resctrl_arch_preconvert_bw(bw, r);
>  	return true;
>  }
>  
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index a7d92718b653..1fb6e2118b61 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -485,6 +485,20 @@ bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
>   */
>  int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
>  
> +/*
> + * Convert a bandwidth control value to the appropriate form for
> + * consumption by the hardware driver for resource r.

When being as descriptive, please switch to proper kernel-doc instead. There are
a couple of examples for reference in this header file.

> + *
> + * For example, it simplifies the x86 RDT implementation to round the
> + * value to a suitable step here and then treat the resulting value as
> + * opaque when programming the hardware MSRs later on.

This "For example" can be dropped.

> + *
> + * Architectures for which this pre-conversion hook is not useful
> + * should supply an implementation of this function that just returns
> + * val unmodified.
> + */
> +u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r);
> +
>  /*
>   * Update the ctrl_val and apply this config right now.
>   * Must be called on one of the domain's CPUs.
> 
> base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa

Reinette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-11-20  3:05     ` Reinette Chatre
@ 2025-11-20 17:04       ` Dave Martin
  2025-11-20 21:55         ` Reinette Chatre
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Martin @ 2025-11-20 17:04 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: linux-kernel, Tony Luck, James Morse, Ben Horgan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jonathan Corbet, x86, linux-doc

Hi Reinette,

On Wed, Nov 19, 2025 at 07:05:10PM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 11/18/25 7:47 AM, Dave Martin wrote:
> > Hi Reinette,
> > 
> > On Fri, Nov 14, 2025 at 02:17:53PM -0800, Reinette Chatre wrote:
> >> Hi Dave,
> >>
> >> On 10/31/25 8:41 AM, Dave Martin wrote:
> >>> The control value parser for the MB resource currently coerces the
> >>> memory bandwidth percentage value from userspace to be an exact
> >>> multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
> >>>
> >>> On MPAM systems, this results in somewhat worse-than-worst-case
> >>> rounding, since the bandwidth granularity advertised to resctrl by the
> >>> MPAM driver is in general only an approximation to the actual hardware
> >>> granularity on these systems, and the hardware bandwidth allocation
> >>> control value is not natively a percentage -- necessitating a further
> >>> conversion in the resctrl_arch_update_domains() path, regardless of the
> >>> conversion done at parse time.
> >>>
> >>> Allow the arch to provide its own parse-time conversion that is
> >>> appropriate for the hardware, and move the existing conversion to x86.
> >>> This will avoid accumulated error from rounding the value twice on MPAM
> >>> systems.
> >>>
> >>> Clarify the documentation, but avoid overly exact promises.
> >>>
> >>> Clamping to bw_min and bw_max still feels generic: leave it in the core
> >>> code, for now.
> >>
> >> I think they are only theoretically generic since arch sets them and resctrl
> >> uses to enforce user input. Arch can thus theoretically set them to whatever
> >> the u32 used to represent it allows. Of course, doing something like this makes
> >> the interface even harder for users to use.
> > 
> > Hence, "feels".
> > 
> > This could perhaps be refined, but I didn't see an obvious reason to
> > change the way this works, right now.
> > 
> > Or, is there a problem here that I'm missing?
> 
> No. We are in agreement.

OK, thanks for clarifying.

[...]

> >>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> >>> index b7f35b07876a..fbbcf5421346 100644
> >>> --- a/Documentation/filesystems/resctrl.rst
> >>> +++ b/Documentation/filesystems/resctrl.rst
> >>> @@ -144,12 +144,11 @@ with respect to allocation:
> >>>  		user can request.
> >>>  
> >>>  "bandwidth_gran":
> >>> -		The granularity in which the memory bandwidth
> >>> -		percentage is allocated. The allocated
> >>> -		b/w percentage is rounded off to the next
> >>> -		control step available on the hardware. The
> >>> -		available bandwidth control steps are:
> >>> -		min_bandwidth + N * bandwidth_gran.
> >>> +		The approximate granularity in which the memory bandwidth
> >>> +		percentage is allocated. The allocated bandwidth percentage
> >>> +		is rounded up to the next control step available on the
> >>> +		hardware. The available hardware steps are no larger than
> >>> +		this value.
> >>>  
> >>>  "delay_linear":
> >>>  		Indicates if the delay scale is linear or
> >>> @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
> >>>  and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >>>  granularity that is allocated is also dependent on the cpu model and can
> >>>  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> >>> -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> >>> -to the next control step available on the hardware.
> >>> +control steps are, approximately, min_bw + N * bw_gran.  The steps may
> >>> +appear irregular due to rounding to an exact percentage: bw_gran is the
> >>> +maximum interval between the percentage values corresponding to any two
> >>> +adjacent steps in the hardware.
> >>
> >> What can bw_gran be expected to be on Arm systems? Could existing usage be supported with
> >> MPAM setting bw_gran to 1?
> > 
> > Architecturally, ceil(100.0 / (1 << b)), where 1 <= b <= 16.
> > 
> > So, the possible values are
> > 
> > 	1, 2, 4, 7, 13, 25, 50
> > 
> > (with 25 and 50 being the only ones that are exact).
> > 
> > In practice, something like 6 <= b <= 8 is probably more typical; this
> > would give advertised bandwidth_gran values of 2 or 1.
> > 
> > 
> > Re your question about existing usage, were you suggesting
> > unconditionally setting bw_gran to 1?
> 
> Yes, I *was* suggesting that. I considered bw_gran of 1 to be a "safe" value that
> provides architecture with most flexibility. Now that you have provided more
> insight on how MPAM uses this value I agree that bw_gran of 1 is not appropriate
> for MPAM.

Right.

> > Since the values are converted to/from hardware representation in
> > resctrl_arch_update_domains() / resctrl_arch_get_config(), we
> > don't have a problem, provided that the value doesn't get rounded in
> > advance by bw_validate().
> > 
> > But if bw_gran is always 1, it will mislead userspace about the
> > precision.  This feels stranger for userspace than fine differences in
> > precisely which percentage values get read out of schemata.
> > 
> >> What will these control steps actually look like when the user views the schemata file
> >> on an Arm system?
> > 
> > It depends on the number of bits in the hardware value (the parameter b
> > above).  Picking the smallest, non-trivial value 3:
> > 
> > 	schemata	hardware (MBW_MAX)
> > 
> > 	 13		0
> > 	 25		1
> > 	 38		2
> > 	 50		3
> > 	 63		4
> > 	 75		5
> > 	 88		6
> > 	100		7
> > 
> > As currently implemented, I believe that writing the values from the
> > "schemata" column above will result in the corresponding values from
> > the "hardware" column being written to the hardware.  Achitecturally,
> > there is no guaranteed representation for 0%; we would advertise min=13,
> > max=100 in info/.)
> > 
> > A round-to-nearest policy is followed for intermediate values on write.
> > 
> > 	(hardware value = round(schemata value * 8.0 / 100.0) - 1).)
> > 
> > Reading the value back translates the value from the "hardware" column
> > back to the unique value in the "schemata" column.
> > 
> > 	(schemata value = round((hardware value + 1) * 100.0 / (1 << b)).)
> > 
> > 
> > In this case, bandwidth_gran would be advertised as 13, which is the
> > largest step that can be seen in the read-back values.
> 
> Thank you very much for these details. I appreciate having a better understanding
> on how the hardware enforces these control steps instead of some software emulation.
> With these steps enforced in this way on the architecture side it creates 
> confidence that the user space expectations from interface can be met directly
> by architecture.
> 
> > I would rather avoid promising precisely which values can be read out;
> > merely that they are consistent with the approximate precision defined
> > by the bandwidth_gran parameter.
> 
> ack. I believe the accompanied changes to resctrl.rst supports this.

OK, good.  Getting the rounding behaviour right so that the MPAM
conversion is stable turned out to be a subtle business, so it's
reassuring that if we find it isn't implemented quite right and need to
fine-tune it later, we don't consider that an interface break (within
reason).

> >> With resctrl "coercing" the user provided value before providing it to the architecture
> >> it controls these control steps to match what the documentation states above. If resctrl
> >> instead provides the value directly to the architecture I see nothing preventing the
> >> architecture from ignoring resctrl's "contract" with user space documented above and
> >> using arbitrary control steps since it also controls resctrl_arch_get_config() that is
> >> displayed directly to user space. What guarantee is there that resctrl_arch_get_config()
> >> will display a value that is "approximately" min_bw + N * bw_gran? This seems like opening
> > 
> > No guarantee, but there will only be one resctrl_arch_preconvert_bw()
> > per arch.  We'd expect the functions to be simple, so this doesn't feel
> > like an excessive maintenance burden?
> 
> Agree.
> 
> > 
> > (All the resctrl_arch_foo() hooks have to do the right thing after all,
> > otherwise nothing will work.)
> > 
> > 
> > With this patch, resctrl_arch_preconvert_bw() and
> > resctrl_arch_{update_domains,get_config}() between them must provide
> > the correct behaviour.
> 
> Right. This blurs the lines of responsibility. I interpret this as:
> "resctrl fs makes promises to user space that resctrl fs *and* the architecture are
>  responsible to keep"

Ack.  It's a joint responsbility.

Thinking about it, I don't think that the value returned by
resctrl_arch_preconvert_bw() is used in any way except for storing it
into rdt_ctrl_domain::staged_configs[] for later consumption by
resctrl_arch_update_domains().

If so, the meaning of this value is really arch-determined.

It is convenient for the x86 implementation to convert this to hardware
form at parse time, whereas, because of the way the MPAM code is
structured, it suits the MPAM driver better to convert it later on.

In the x86 case, it means that the arch code doesn't have to
distinguish much between input values for different kinds of control:
it's just a matter of writing precomputed values to MSRs.  This keeps
parts of the backend simple.

In the MPAM case, it's more about encapsulating grungy arch-specific
details in the driver.  The common resctrl structures don't contain all
the information needed.  We might have to pull a lot of junk out of the
private headers in order to the conversion in the context of an inline
function called by the schemata parser.

I think that it's likely that either all the conversion work is done in
resctrl_arch_preconvert_bw() (e.g., x86), or all the work is done in
the resctrl_arch_update_domains() backend (e.g., MPAM).  There's
probably no good reason ever to split the work into two parts.

Does that make sense?

> > 
> > But even today, resctrl_arch_update_domains() and
> > resctrl_arch_get_config() have to do the right thing in order for
> > bandwidth control values to be handled correctly, as seen through the
> > schemata interface.
> 
> ack.
> 
> > 
> > (x86's job is easy, because the generic interface between the resctrl
> > core and the arch interface happens to be expressed in terms of raw x86
> > MSR values due to the history.  But other arches don't get the benefit
> > of that.)
> 
> That is just the benefit of being the first architecture to be supported.
> If determined to cause headaches elsewhere it can surely change.
> > The reason for this patch is the generic code can't do the right thing
> > for MPAM, unless there is a hook into the arch code, arch-/hardware-
> > specific knowledge is added in the core code, or unless a misleading
> > bw_gran value is advertised.
> Understood.
> 
> > 
> > I tried to take the pragmatic approach, but I'm open to suggestions if
> > you'd prefer this to be factored in another way.
> 
> I am ok with this approach and will respond to the patch details separately.

OK, thanks -- I see you already replied, so I'll respond there.

> > 
> >> the door even wider for resctrl to become architecture specific ... with this change the
> >> schemata file becomes a direct channel between user space and the arch that risks users
> >> needing to tread carefully when switching between different architectures.
> > 
> > There doesn't feel like a magic solution here.
> > 
> > Exact bandwidth and flow control behaviour is extremely dependent on
> > hardware topology and the design of interconnects and on dynamic system
> > load.  An interface that is generic and rigorously defined, yet also
> > simple enough to be reasonably usable by portable software would
> > probably not be implementable on any real hardware platform.
> > 
> > So, if it's useful to have a generic interface at all, hardware and
> > software are going to have to meet in the middle somewhere...
> 
> I believe that we could also use above as a quote in support of the schema
> description work.
> 
> > 
> > (The historical behaviour -- and even the interface -- of MB is not
> > generic either, right?)
> 
> Right. Even so, I prefer to use MB as motivation to do things better rather
> than an excuse to make things worse.
> 
> Reinette

Cheap shot, I know.

I guess that more is known now about what kinds of control behaviour
are likely to exist than was known about when resctrl was first
developed... though we still don't have a crystal ball.

My aim is generalise enough to cover most of what we know about today,
while not inventing pie-in-the-sky abstractions that may never be
fully used...  It's a balancing act, though.

There's a fine like between a "random nonportable junk" schema and
reasonable exposure of an architecture-specific control that makes
sense within resctrl but is unlikely to map onto other architectures.

We should certainly push back on the latter, but could it be appropriate
to expose arch-specific control types in some situations?  I don't like
to rule it out absolutely.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-11-20 16:46 ` Reinette Chatre
@ 2025-11-20 17:42   ` Dave Martin
  2025-11-20 22:01     ` Reinette Chatre
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Martin @ 2025-11-20 17:42 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: linux-kernel, Tony Luck, James Morse, Ben Horgan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jonathan Corbet, x86, linux-doc

On Thu, Nov 20, 2025 at 08:46:24AM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 10/31/25 8:41 AM, Dave Martin wrote:
> > The control value parser for the MB resource currently coerces the
> > memory bandwidth percentage value from userspace to be an exact
> > multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
> > 
> > On MPAM systems, this results in somewhat worse-than-worst-case
> > rounding, since the bandwidth granularity advertised to resctrl by the
> > MPAM driver is in general only an approximation to the actual hardware
> > granularity on these systems, and the hardware bandwidth allocation
> > control value is not natively a percentage -- necessitating a further
> > conversion in the resctrl_arch_update_domains() path, regardless of the
> > conversion done at parse time.
> > 
> > Allow the arch to provide its own parse-time conversion that is
> > appropriate for the hardware, and move the existing conversion to x86.
> > This will avoid accumulated error from rounding the value twice on MPAM
> > systems.
> > 
> > Clarify the documentation, but avoid overly exact promises.
> > 
> > Clamping to bw_min and bw_max still feels generic: leave it in the core
> > code, for now.
> > 
> > No functional change.
> 
> Please use max line length available. Changelogs have to conform before merge
> and having patch ready saves this work.

Fixed.  Sorry, old habits die hard.

I won't bother with the tearoff text, though, it that's OK -- that
won't go into git.

[...]

> > diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> > index b7f35b07876a..fbbcf5421346 100644
> > --- a/Documentation/filesystems/resctrl.rst
> > +++ b/Documentation/filesystems/resctrl.rst

[...]

> > @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
> >  and can be looked up through "info/MB/min_bandwidth". The bandwidth
> >  granularity that is allocated is also dependent on the cpu model and can
> >  be looked up at "info/MB/bandwidth_gran". The available bandwidth
> > -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
> > -to the next control step available on the hardware.
> > +control steps are, approximately, min_bw + N * bw_gran.  The steps may
> > +appear irregular due to rounding to an exact percentage: bw_gran is the
> > +maximum interval between the percentage values corresponding to any two
> > +adjacent steps in the hardware.
> >  
> >  The bandwidth throttling is a core specific mechanism on some of Intel
> >  SKUs. Using a high bandwidth and a low bandwidth setting on two threads
> 
> The documentation changes look good to me. Thank you.

OK, sounds good.

[...]

> > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > index 0d0ef54fc4de..26e3f7c88c86 100644
> > --- a/fs/resctrl/ctrlmondata.c
> > +++ b/fs/resctrl/ctrlmondata.c
> > @@ -35,8 +35,8 @@ typedef int (ctrlval_parser_t)(struct rdt_parse_data *data,
> >  /*
> >   * Check whether MBA bandwidth percentage value is correct. The value is
> >   * checked against the minimum and max bandwidth values specified by the
> > - * hardware. The allocated bandwidth percentage is rounded to the next
> > - * control step available on the hardware.
> > + * hardware. The allocated bandwidth percentage is converted as
> > + * appropriate for consumption by the specific hardware driver.
> 
> The text looks good but adjusting the right margin mid-paragraph seems unnecessary?
> 

I was trying to follow the prevailing line length; but I guess the
lines existing lines were already shortened by flowing the text, so I
misjudged it.

Fixed locally.

Is there a view on how to flow new text? (not so applicable here)

Keeping the lines a bit short (say, 72 chars) means that minor edits
and typo fixes can be applied without extraneous diff noise most of the
time.  I find reviewing an entire re-flowed paragraph annoying when
there is really just a one-word change buried in the middle of it
somewhere.

Equally, I try to avoid reflowing text for minor edits unless
absolutely necessary; poking a few chars over 80 seems tolerable in
that situation, but I prefer it not to go too far...

> >   */
> >  static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >  {
> > @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
> >  		return false;
> >  	}
> >  
> > -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
> > +	*data = resctrl_arch_preconvert_bw(bw, r);
> >  	return true;
> >  }
> >  
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index a7d92718b653..1fb6e2118b61 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -485,6 +485,20 @@ bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
> >   */
> >  int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
> >  
> > +/*
> > + * Convert a bandwidth control value to the appropriate form for
> > + * consumption by the hardware driver for resource r.
> 
> When being as descriptive, please switch to proper kernel-doc instead. There are
> a couple of examples for reference in this header file.

This comment was pretty trivial to begin with, but grew.

> > + *
> > + * For example, it simplifies the x86 RDT implementation to round the
> > + * value to a suitable step here and then treat the resulting value as
> > + * opaque when programming the hardware MSRs later on.
> 
> This "For example" can be dropped.

Done.

> > + *
> > + * Architectures for which this pre-conversion hook is not useful
> > + * should supply an implementation of this function that just returns
> > + * val unmodified.
> > + */

I've fleshed this out a little, as follows:

--8<--

/**
 * resctrl_arch_preconvert_bw() - Convert a bandwidth control value to the
 *				  appropriate form for consumption by the
 *				  hardware driver for resource r.
 * @val:	Control value written to the schemata file by userspace.
 * @r:		Resource whose schema was written.
 *
 * Return:	The converted value.
 *
 * It simplifies the x86 RDT implementation to round the value to a suitable
 * step at parse time and then treat the resulting value as opaque when
 * programming the hardware MSRs later on.  In this situation, this hook is the
 * correct place to perform the conversion.
 *
 * Architectures for which this pre-conversion hook is not useful should supply
 * an implementation of this function that just returns @val unmodified.
 */
u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r);

-->8--

Does that look like enough?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-11-20 17:04       ` Dave Martin
@ 2025-11-20 21:55         ` Reinette Chatre
  0 siblings, 0 replies; 11+ messages in thread
From: Reinette Chatre @ 2025-11-20 21:55 UTC (permalink / raw)
  To: Dave Martin
  Cc: linux-kernel, Tony Luck, James Morse, Ben Horgan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jonathan Corbet, x86, linux-doc

Hi Dave,

On 11/20/25 9:04 AM, Dave Martin wrote:
> On Wed, Nov 19, 2025 at 07:05:10PM -0800, Reinette Chatre wrote:
>> On 11/18/25 7:47 AM, Dave Martin wrote:
>>> On Fri, Nov 14, 2025 at 02:17:53PM -0800, Reinette Chatre wrote:
>>>> On 10/31/25 8:41 AM, Dave Martin wrote:


 
>>>> With resctrl "coercing" the user provided value before providing it to the architecture
>>>> it controls these control steps to match what the documentation states above. If resctrl
>>>> instead provides the value directly to the architecture I see nothing preventing the
>>>> architecture from ignoring resctrl's "contract" with user space documented above and
>>>> using arbitrary control steps since it also controls resctrl_arch_get_config() that is
>>>> displayed directly to user space. What guarantee is there that resctrl_arch_get_config()
>>>> will display a value that is "approximately" min_bw + N * bw_gran? This seems like opening
>>>
>>> No guarantee, but there will only be one resctrl_arch_preconvert_bw()
>>> per arch.  We'd expect the functions to be simple, so this doesn't feel
>>> like an excessive maintenance burden?
>>
>> Agree.
>>
>>>
>>> (All the resctrl_arch_foo() hooks have to do the right thing after all,
>>> otherwise nothing will work.)
>>>
>>>
>>> With this patch, resctrl_arch_preconvert_bw() and
>>> resctrl_arch_{update_domains,get_config}() between them must provide
>>> the correct behaviour.
>>
>> Right. This blurs the lines of responsibility. I interpret this as:
>> "resctrl fs makes promises to user space that resctrl fs *and* the architecture are
>>  responsible to keep"
> 
> Ack.  It's a joint responsbility.
> 
> Thinking about it, I don't think that the value returned by
> resctrl_arch_preconvert_bw() is used in any way except for storing it
> into rdt_ctrl_domain::staged_configs[] for later consumption by
> resctrl_arch_update_domains().
> 
> If so, the meaning of this value is really arch-determined.
> 
> It is convenient for the x86 implementation to convert this to hardware
> form at parse time, whereas, because of the way the MPAM code is
> structured, it suits the MPAM driver better to convert it later on.
> 
> In the x86 case, it means that the arch code doesn't have to
> distinguish much between input values for different kinds of control:
> it's just a matter of writing precomputed values to MSRs.  This keeps
> parts of the backend simple.
> 
> In the MPAM case, it's more about encapsulating grungy arch-specific
> details in the driver.  The common resctrl structures don't contain all
> the information needed.  We might have to pull a lot of junk out of the
> private headers in order to the conversion in the context of an inline
> function called by the schemata parser.
> 
> I think that it's likely that either all the conversion work is done in
> resctrl_arch_preconvert_bw() (e.g., x86), or all the work is done in
> the resctrl_arch_update_domains() backend (e.g., MPAM).  There's
> probably no good reason ever to split the work into two parts.
> 
> Does that make sense?

Good point. Yes.

> 
>>>
>>> But even today, resctrl_arch_update_domains() and
>>> resctrl_arch_get_config() have to do the right thing in order for
>>> bandwidth control values to be handled correctly, as seen through the
>>> schemata interface.
>>
>> ack.
>>
>>>
>>> (x86's job is easy, because the generic interface between the resctrl
>>> core and the arch interface happens to be expressed in terms of raw x86
>>> MSR values due to the history.  But other arches don't get the benefit
>>> of that.)
>>
>> That is just the benefit of being the first architecture to be supported.
>> If determined to cause headaches elsewhere it can surely change.
>>> The reason for this patch is the generic code can't do the right thing
>>> for MPAM, unless there is a hook into the arch code, arch-/hardware-
>>> specific knowledge is added in the core code, or unless a misleading
>>> bw_gran value is advertised.
>> Understood.
>>
>>>
>>> I tried to take the pragmatic approach, but I'm open to suggestions if
>>> you'd prefer this to be factored in another way.
>>
>> I am ok with this approach and will respond to the patch details separately.
> 
> OK, thanks -- I see you already replied, so I'll respond there.
> 
>>>
>>>> the door even wider for resctrl to become architecture specific ... with this change the
>>>> schemata file becomes a direct channel between user space and the arch that risks users
>>>> needing to tread carefully when switching between different architectures.
>>>
>>> There doesn't feel like a magic solution here.
>>>
>>> Exact bandwidth and flow control behaviour is extremely dependent on
>>> hardware topology and the design of interconnects and on dynamic system
>>> load.  An interface that is generic and rigorously defined, yet also
>>> simple enough to be reasonably usable by portable software would
>>> probably not be implementable on any real hardware platform.
>>>
>>> So, if it's useful to have a generic interface at all, hardware and
>>> software are going to have to meet in the middle somewhere...
>>
>> I believe that we could also use above as a quote in support of the schema
>> description work.
>>
>>>
>>> (The historical behaviour -- and even the interface -- of MB is not
>>> generic either, right?)
>>
>> Right. Even so, I prefer to use MB as motivation to do things better rather
>> than an excuse to make things worse.
>>
>> Reinette
> 
> Cheap shot, I know.
> 
> I guess that more is known now about what kinds of control behaviour
> are likely to exist than was known about when resctrl was first
> developed... though we still don't have a crystal ball.
> 
> My aim is generalise enough to cover most of what we know about today,
> while not inventing pie-in-the-sky abstractions that may never be
> fully used...  It's a balancing act, though.
> 
> There's a fine like between a "random nonportable junk" schema and
> reasonable exposure of an architecture-specific control that makes
> sense within resctrl but is unlikely to map onto other architectures.
> 
> We should certainly push back on the latter, but could it be appropriate
> to expose arch-specific control types in some situations?  I don't like
> to rule it out absolutely.

It is difficult to fully reason about merits without an example as reference.
We'll have better information when faced with enabling of such a control.
Some initial thoughts are: First, while a control may start as unique to an arch
we cannot predict that such a control will remain so. Second, with the new
schema descriptions it should be possible to easily add support for a
new control type. Of course, the first arch that supports the control type
has benefit of establishing the needed parameters. It would be ideal if a new
control could be mapped so that it appears all architectures support it but
I do not think resctrl should make that a requirement for support of a new
control.

Reinette




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch
  2025-11-20 17:42   ` Dave Martin
@ 2025-11-20 22:01     ` Reinette Chatre
  0 siblings, 0 replies; 11+ messages in thread
From: Reinette Chatre @ 2025-11-20 22:01 UTC (permalink / raw)
  To: Dave Martin
  Cc: linux-kernel, Tony Luck, James Morse, Ben Horgan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jonathan Corbet, x86, linux-doc

Hi Dave,

On 11/20/25 9:42 AM, Dave Martin wrote:
> On Thu, Nov 20, 2025 at 08:46:24AM -0800, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 10/31/25 8:41 AM, Dave Martin wrote:
>>> The control value parser for the MB resource currently coerces the
>>> memory bandwidth percentage value from userspace to be an exact
>>> multiple of the rdt_resource::resctrl_membw::bw_gran parameter.
>>>
>>> On MPAM systems, this results in somewhat worse-than-worst-case
>>> rounding, since the bandwidth granularity advertised to resctrl by the
>>> MPAM driver is in general only an approximation to the actual hardware
>>> granularity on these systems, and the hardware bandwidth allocation
>>> control value is not natively a percentage -- necessitating a further
>>> conversion in the resctrl_arch_update_domains() path, regardless of the
>>> conversion done at parse time.
>>>
>>> Allow the arch to provide its own parse-time conversion that is
>>> appropriate for the hardware, and move the existing conversion to x86.
>>> This will avoid accumulated error from rounding the value twice on MPAM
>>> systems.
>>>
>>> Clarify the documentation, but avoid overly exact promises.
>>>
>>> Clamping to bw_min and bw_max still feels generic: leave it in the core
>>> code, for now.
>>>
>>> No functional change.
>>
>> Please use max line length available. Changelogs have to conform before merge
>> and having patch ready saves this work.
> 
> Fixed.  Sorry, old habits die hard.
> 
> I won't bother with the tearoff text, though, it that's OK -- that
> won't go into git.

ok with me. I am not familiar with precedent in this regard to predict x86 maintainers'
preference. 

> 
> [...]
> 
>>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>>> index b7f35b07876a..fbbcf5421346 100644
>>> --- a/Documentation/filesystems/resctrl.rst
>>> +++ b/Documentation/filesystems/resctrl.rst
> 
> [...]
> 
>>> @@ -737,8 +736,10 @@ The minimum bandwidth percentage value for each cpu model is predefined
>>>  and can be looked up through "info/MB/min_bandwidth". The bandwidth
>>>  granularity that is allocated is also dependent on the cpu model and can
>>>  be looked up at "info/MB/bandwidth_gran". The available bandwidth
>>> -control steps are: min_bw + N * bw_gran. Intermediate values are rounded
>>> -to the next control step available on the hardware.
>>> +control steps are, approximately, min_bw + N * bw_gran.  The steps may
>>> +appear irregular due to rounding to an exact percentage: bw_gran is the
>>> +maximum interval between the percentage values corresponding to any two
>>> +adjacent steps in the hardware.
>>>  
>>>  The bandwidth throttling is a core specific mechanism on some of Intel
>>>  SKUs. Using a high bandwidth and a low bandwidth setting on two threads
>>
>> The documentation changes look good to me. Thank you.
> 
> OK, sounds good.
> 
> [...]
> 
>>> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
>>> index 0d0ef54fc4de..26e3f7c88c86 100644
>>> --- a/fs/resctrl/ctrlmondata.c
>>> +++ b/fs/resctrl/ctrlmondata.c
>>> @@ -35,8 +35,8 @@ typedef int (ctrlval_parser_t)(struct rdt_parse_data *data,
>>>  /*
>>>   * Check whether MBA bandwidth percentage value is correct. The value is
>>>   * checked against the minimum and max bandwidth values specified by the
>>> - * hardware. The allocated bandwidth percentage is rounded to the next
>>> - * control step available on the hardware.
>>> + * hardware. The allocated bandwidth percentage is converted as
>>> + * appropriate for consumption by the specific hardware driver.
>>
>> The text looks good but adjusting the right margin mid-paragraph seems unnecessary?
>>
> 
> I was trying to follow the prevailing line length; but I guess the
> lines existing lines were already shortened by flowing the text, so I
> misjudged it.
> 
> Fixed locally.
> 
> Is there a view on how to flow new text? (not so applicable here)
> 
> Keeping the lines a bit short (say, 72 chars) means that minor edits
> and typo fixes can be applied without extraneous diff noise most of the
> time.  I find reviewing an entire re-flowed paragraph annoying when
> there is really just a one-word change buried in the middle of it
> somewhere.
> 
> Equally, I try to avoid reflowing text for minor edits unless
> absolutely necessary; poking a few chars over 80 seems tolerable in
> that situation, but I prefer it not to go too far...

80 is not a strict rule. Latest comment from Boris in this regard can be
found in link below (made after he went through and reformatted most changelogs
in that resctrl series):
https://lore.kernel.org/lkml/20250916105447.GCaMlB976WLxHHeNMD@fat_crate.local/

> 
>>>   */
>>>  static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
>>>  {
>>> @@ -69,7 +69,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r)
>>>  		return false;
>>>  	}
>>>  
>>> -	*data = roundup(bw, (unsigned long)r->membw.bw_gran);
>>> +	*data = resctrl_arch_preconvert_bw(bw, r);
>>>  	return true;
>>>  }
>>>  
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index a7d92718b653..1fb6e2118b61 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -485,6 +485,20 @@ bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r);
>>>   */
>>>  int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable);
>>>  
>>> +/*
>>> + * Convert a bandwidth control value to the appropriate form for
>>> + * consumption by the hardware driver for resource r.
>>
>> When being as descriptive, please switch to proper kernel-doc instead. There are
>> a couple of examples for reference in this header file.
> 
> This comment was pretty trivial to begin with, but grew.
> 
>>> + *
>>> + * For example, it simplifies the x86 RDT implementation to round the
>>> + * value to a suitable step here and then treat the resulting value as
>>> + * opaque when programming the hardware MSRs later on.
>>
>> This "For example" can be dropped.
> 
> Done.
> 
>>> + *
>>> + * Architectures for which this pre-conversion hook is not useful
>>> + * should supply an implementation of this function that just returns
>>> + * val unmodified.
>>> + */
> 
> I've fleshed this out a little, as follows:
> 
> --8<--
> 
> /**
>  * resctrl_arch_preconvert_bw() - Convert a bandwidth control value to the
>  *				  appropriate form for consumption by the
>  *				  hardware driver for resource r.

This should be a short description. For example: "Prepare bandwidth value for arch use"
Longer description follows the argument descriptions.

>  * @val:	Control value written to the schemata file by userspace.

This can be more specific with: "Bandwidth control value ..."

>  * @r:		Resource whose schema was written.

This is where longer description is expected. For more details you can refer to
Documentation/doc-guide/kernel-doc.rst

>  *
>  * Return:	The converted value.
>  *
>  * It simplifies the x86 RDT implementation to round the value to a suitable
>  * step at parse time and then treat the resulting value as opaque when
>  * programming the hardware MSRs later on.  In this situation, this hook is the
>  * correct place to perform the conversion.

My apologies for not being specific. I meant that the entire x86 specific "For example"
paragraph can be dropped. Unless there are some quirks the architecture should watch out for
(which should be avoided) the focus should just be on what is expected from
architecture and/or insight to architecture how resctrl interacts with the call.
For example (using some of your text from other thread),

   Convert the user provided bandwidth control value to an appropriate form for
   consumption by the hardware driver for resource @r. Converted value is stored in
   rdt_ctrl_domain::staged_config[] for later consumption by resctrl_arch_update_domains().
   Is not called when MBA software controller is enabled.

>  *
>  * Architectures for which this pre-conversion hook is not useful should supply
>  * an implementation of this function that just returns @val unmodified.
>  */
> u32 resctrl_arch_preconvert_bw(u32 val, const struct rdt_resource *r);
> 
> -->8--
> 

Reinette


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-11-20 22:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-31 15:41 [PATCH v2] x86,fs/resctrl: Factor MBA parse-time conversion to be per-arch Dave Martin
2025-11-10 11:06 ` Ben Horgan
2025-11-20 15:30   ` Dave Martin
2025-11-14 22:17 ` Reinette Chatre
2025-11-18 15:47   ` Dave Martin
2025-11-20  3:05     ` Reinette Chatre
2025-11-20 17:04       ` Dave Martin
2025-11-20 21:55         ` Reinette Chatre
2025-11-20 16:46 ` Reinette Chatre
2025-11-20 17:42   ` Dave Martin
2025-11-20 22:01     ` Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).