* [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association
@ 2026-01-21 21:12 Babu Moger
2026-01-21 21:12 ` [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE) Babu Moger
` (19 more replies)
0 siblings, 20 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
This patch series adds support for Global Bandwidth Enforcement (GLBE),
Global Slow Bandwidth Enforcement (GLSBE) and Priviledge Level Zero
Association (PLZA) in the x86 architecture's fs/resctrl subsystem. The
changes include modifications to the resctrl filesystem to allow users to
configure GLBE settings and associate CPUs and tasks with a separate CLOS
when executiing in CPL0.
The feature documentation is not yet publicly available, but it is expected
to be released in the next few weeks. In the meantime, a brief description
of the features is provided below. Sharing this series as an RFC to gather
initial feedback. Comments are welcome.
Global Bandwidth Enforcement (GLBE)
AMD Global Bandwidth Enforcement (GLBE) provides a mechanism for software
to specify bandwidth limits for groups of threads that span multiple QOoS
Domains. This collection of QOS Domains is referred to as the GLBE Control
Domain. The GLBE ceiling is a bandwidth ceiling for L3 External Bandwidth
competitively shared between all threads in a COS (Class of Service) across
all QOS Domains within the GLBE Control Domain. This complements L3BE L3
External Bandwidth Enforcement (L3BE) which provides L3 eExternal Bandwidth
control on a per QOS Domain granularity.
Global Slow Bandwidth Enforcement (GLSBE)
AMD PQoS Global Slow Bandwidth Enforcement (GLSBE) provides a mechanism for
software to specify bandwidth limits for groups of threads that span
multiple QOS Domains. GLSBE operates within the same GLBE Control Domains
defined by GLBE. The GLSBE ceiling is a bandwidth ceiling for L3 External
Bandwidth to Slow Memory competitively shared between all threads in a COS
in all QOS Domains within the GLBE Control Domain. This complements L3SMBE
which provides Slow Memory bandwidth control on a per QOS Domain
granularity.
Privilege Level Zero Association (PLZA)
Privilege Level Zero Association (PLZA) allows the hardware to
automatically associate execution in Privilege Level Zero (CPL=0) with a
specific COS (Class of Service) and/or RMID (Resource Monitoring
Identifier). The QoS feature set already has a mechanism to associate
execution on each logical processor with an RMID or COS. PLZA allows the
system to override this per-thread association for a thread that is
executing with CPL=0.
The patches are based on top of commit (v6.19-rc5)
Commit 0f61b1860cc3 (tag: v6.19-rc5, tip/tip/urgent) Linux 6.19-rc5
Changes include:
- Introduction of a new max_bandwidth file for each resctrl resource to
expose the maximum supported bandwidth.
- Addition of new schemata GMB and GSMBA interfaces for configuring GLBE
and GSLBE parameters.
- Modifications to associate resctrl groups with PLZA.
- Documentation updates to describe the new functionality.
Interface Changes:
1. A new max_bandwidth file has been added under each resource type
directory (for example, /sys/fs/resctrl/info/GMB/max_bandwidth) to
report the maximum bandwidth supported by the resource.
2. New resource types, GMB and GSMBA, have been introduced and are exposed
through the schemata interface:
# cat /sys/fs/resctrl/schemata
GSMBA:0=4096;1=4096
SMBA:0=8192;1=8192
GMB:0=4096;1=4096
MB:0=8192;1=8192
L3:0=ffff;1=ffff
3. A new plza_capable file has been added under each resource type directory
(for example, /sys/fs/resctrl/info/GMB/plza_capable) to indicate whether
the resource supports the PLZA feature.
4. A new plza control file has been added to each resctrl group (for example,
/sys/fs/resctrl/plza) to enable or disable PLZA association for the group.
Writing 1 enables PLZA for the group, while writing 0 disables it.
Babu Moger (19):
x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
x86,fs/resctrl: Add the resource for Global Memory Bandwidth
Allocation
fs/resctrl: Add new interface max_bandwidth
fs/resctrl: Add the documentation for Global Memory Bandwidth
Allocation
x86,fs/resctrl: Add support for Global Slow Memory Bandwidth
Allocation (GSMBA)
x86,fs/resctrl: Add the resource for Global Slow Memory Bandwidth
Enforcement(GLSBE)
fs/resctrl: Add the documentation for Global Slow Memory Bandwidth
Allocation
x86/resctrl: Support Privilege-Level Zero Association (PLZA)
x86/resctrl: Add plza_capable in rdt_resource data structure
fs/resctrl: Expose plza_capable via control info file
resctrl: Introduce PLZA static key enable/disable helpers
x86/resctrl: Add data structures and definitions for PLZA
configuration
x86/resctrl: Add PLZA state tracking and context switch handling
x86,fs/resctrl: Add the functionality to configure PLZA
fs/resctrl: Introduce PLZA attribute in rdtgroup interface
fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a
group
fs/resctrl: Update PLZA configuration when cpu_mask changes
x86/resctrl: Refactor show_rdt_tasks() to support PLZA task matching
fs/resctrl: Add per-task PLZA enable support via rdtgroup
.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/filesystems/resctrl.rst | 110 ++++++-
arch/x86/include/asm/cpufeatures.h | 4 +-
arch/x86/include/asm/msr-index.h | 9 +
arch/x86/include/asm/resctrl.h | 44 +++
arch/x86/kernel/cpu/resctrl/core.c | 89 +++++-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 25 ++
arch/x86/kernel/cpu/resctrl/internal.h | 26 ++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 +
arch/x86/kernel/cpu/scattered.c | 3 +
fs/resctrl/ctrlmondata.c | 5 +-
fs/resctrl/internal.h | 2 +
fs/resctrl/rdtgroup.c | 301 +++++++++++++++++-
include/linux/resctrl.h | 16 +
include/linux/sched.h | 1 +
15 files changed, 623 insertions(+), 21 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 114+ messages in thread
* [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-02-09 18:44 ` Reinette Chatre
2026-01-21 21:12 ` [RFC PATCH 02/19] x86,fs/resctrl: Add the resource for Global Memory Bandwidth Allocation Babu Moger
` (18 subsequent siblings)
19 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
On AMD systems, the existing MBA feature allows the user to set a bandwidth
limit for each QOS domain. However, multiple QOS domains share system
memory bandwidth as a resource. In order to ensure that system memory
bandwidth is not over-utilized, user must statically partition the
available system bandwidth between the active QOS domains. This typically
results in system memory being under-utilized since not all QOS domains are
using their full bandwidth Allocation.
AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
for software to specify bandwidth limits for groups of threads that span
multiple QoS Domains. This collection of QOS domains is referred to as GLBE
control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
in GLBE control domain. Bandwidth is shared by all threads in a Class of
Service(COS) across every QoS domain managed by the GLBE control domain.
GLBE support is reported through CPUID.8000_0020_EBX_x0[GLBE] (bit 7).
When this bit is set to 1, the platform supports GLBE.
Since the AMD Memory Bandwidth Enforcement feature is represented as MBA,
the Global Bandwidth Enforcement feature will be shown as GMBA to maintain
consistent naming.
Add GMBA support to resctrl and introduce a kernel parameter that allows
enabling or disabling the feature at boot time.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
arch/x86/kernel/cpu/scattered.c | 1 +
4 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index abd77f39c783..e3058b3d47e9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6325,7 +6325,7 @@ Kernel parameters
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec, abmc, sdciae, energy[:guid],
+ mba, gmba, smba, bmec, abmc, sdciae, energy[:guid],
perf[:guid].
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index c3b53beb1300..86d1339cd1bd 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -505,7 +505,6 @@
#define X86_FEATURE_ABMC (21*32+15) /* Assignable Bandwidth Monitoring Counters */
#define X86_FEATURE_MSR_IMM (21*32+16) /* MSR immediate form instructions */
#define X86_FEATURE_SGX_EUPDATESVN (21*32+17) /* Support for ENCLS[EUPDATESVN] instruction */
-
#define X86_FEATURE_SDCIAE (21*32+18) /* L3 Smart Data Cache Injection Allocation Enforcement */
#define X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO (21*32+19) /*
* Clear CPU buffers before VM-Enter if the vCPU
@@ -513,6 +512,7 @@
* and purposes if CLEAR_CPU_BUF_VM is set).
*/
#define X86_FEATURE_X2AVIC_EXT (21*32+20) /* AMD SVM x2AVIC support for 4k vCPUs */
+#define X86_FEATURE_GMBA (21*32+21) /* Global Memory Bandwidth Allocation */
/*
* BUG word(s)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9fcc06e9e72e..8b3457518ff4 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -795,6 +795,7 @@ enum {
RDT_FLAG_L2_CAT,
RDT_FLAG_L2_CDP,
RDT_FLAG_MBA,
+ RDT_FLAG_GMBA,
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
RDT_FLAG_ABMC,
@@ -822,6 +823,7 @@ static struct rdt_options rdt_options[] __ro_after_init = {
RDT_OPT(RDT_FLAG_L2_CAT, "l2cat", X86_FEATURE_CAT_L2),
RDT_OPT(RDT_FLAG_L2_CDP, "l2cdp", X86_FEATURE_CDP_L2),
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
+ RDT_OPT(RDT_FLAG_GMBA, "gmba", X86_FEATURE_GMBA),
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 42c7eac0c387..d081d167bac9 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -59,6 +59,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_BMEC, CPUID_EBX, 3, 0x80000020, 0 },
{ X86_FEATURE_ABMC, CPUID_EBX, 5, 0x80000020, 0 },
{ X86_FEATURE_SDCIAE, CPUID_EBX, 6, 0x80000020, 0 },
+ { X86_FEATURE_GMBA, CPUID_EBX, 7, 0x80000020, 0 },
{ X86_FEATURE_TSA_SQ_NO, CPUID_ECX, 1, 0x80000021, 0 },
{ X86_FEATURE_TSA_L1_NO, CPUID_ECX, 2, 0x80000021, 0 },
{ X86_FEATURE_AMD_WORKLOAD_CLASS, CPUID_EAX, 22, 0x80000021, 0 },
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 02/19] x86,fs/resctrl: Add the resource for Global Memory Bandwidth Allocation
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
2026-01-21 21:12 ` [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE) Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 03/19] fs/resctrl: Add new interface max_bandwidth Babu Moger
` (17 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
AMD PQoS Global Bandwidth Enforcementn(GLBE) provides a mechanism for
software to specify bandwidth limits for groups of threads that span
multiple QoS Domains.
Add the resource definition for GLBE in resctrl filesystem. Resource
allows users to configure and manage the global memory bandwidth allocation
settings for GLBE domain. GLBE domain is set of participating QoS domains
that are grouped together for global bandwidth allocation.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 46 ++++++++++++++++++++++++++++--
fs/resctrl/ctrlmondata.c | 3 +-
fs/resctrl/rdtgroup.c | 13 +++++++--
include/linux/resctrl.h | 1 +
5 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 43adc38d31d5..e9b21676102c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1274,6 +1274,7 @@
#define MSR_IA32_L3_QOS_ABMC_CFG 0xc00003fd
#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
+#define MSR_IA32_GMBA_BW_BASE 0xc0000600
/* AMD-V MSRs */
#define MSR_VM_CR 0xc0010114
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8b3457518ff4..8801dcfb40fb 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -91,6 +91,15 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
.schema_fmt = RESCTRL_SCHEMA_RANGE,
},
},
+ [RDT_RESOURCE_GMBA] =
+ {
+ .r_resctrl = {
+ .name = "GMB",
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_GMBA),
+ .schema_fmt = RESCTRL_SCHEMA_RANGE,
+ },
+ },
[RDT_RESOURCE_SMBA] =
{
.r_resctrl = {
@@ -239,10 +248,22 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
u32 eax, ebx, ecx, edx, subleaf;
/*
- * Query CPUID_Fn80000020_EDX_x01 for MBA and
- * CPUID_Fn80000020_EDX_x02 for SMBA
+ * Query CPUID function 0x80000020 to obtain num_closid and max_bw values.
+ * Use subleaf 1 for MBA, subleaf 2 for SMBA, and subleaf 7 for GMBA.
*/
- subleaf = (r->rid == RDT_RESOURCE_SMBA) ? 2 : 1;
+ switch (r->rid) {
+ case RDT_RESOURCE_MBA:
+ subleaf = 1;
+ break;
+ case RDT_RESOURCE_SMBA:
+ subleaf = 2;
+ break;
+ case RDT_RESOURCE_GMBA:
+ subleaf = 7;
+ break;
+ default:
+ return false;
+ }
cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx);
hw_res->num_closid = edx + 1;
@@ -909,6 +930,19 @@ static __init bool get_mem_config(void)
return false;
}
+static __init bool get_gmem_config(void)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_GMBA];
+
+ if (!rdt_cpu_has(X86_FEATURE_GMBA))
+ return false;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+ return __rdt_get_mem_config_amd(&hw_res->r_resctrl);
+
+ return false;
+}
+
static __init bool get_slow_mem_config(void)
{
struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_SMBA];
@@ -954,6 +988,9 @@ static __init bool get_rdt_alloc_resources(void)
if (get_mem_config())
ret = true;
+ if (get_gmem_config())
+ ret = true;
+
if (get_slow_mem_config())
ret = true;
@@ -1054,6 +1091,9 @@ static __init void rdt_init_res_defs_amd(void)
} else if (r->rid == RDT_RESOURCE_MBA) {
hw_res->msr_base = MSR_IA32_MBA_BW_BASE;
hw_res->msr_update = mba_wrmsr_amd;
+ } else if (r->rid == RDT_RESOURCE_GMBA) {
+ hw_res->msr_base = MSR_IA32_GMBA_BW_BASE;
+ hw_res->msr_update = mba_wrmsr_amd;
} else if (r->rid == RDT_RESOURCE_SMBA) {
hw_res->msr_base = MSR_IA32_SMBA_BW_BASE;
hw_res->msr_update = mba_wrmsr_amd;
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index cc4237c57cbe..ad7327b90d3f 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -246,7 +246,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
return -EINVAL;
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
- (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) {
+ (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_GMBA ||
+ r->rid == RDT_RESOURCE_SMBA)) {
rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
return -EINVAL;
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index ba8d503551cd..ae6c515f4c19 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1412,7 +1412,8 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
- if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
+ if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_GMBA ||
+ r->rid == RDT_RESOURCE_SMBA)
continue;
has_cache = true;
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
@@ -1615,6 +1616,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
closid,
type);
if (r->rid == RDT_RESOURCE_MBA ||
+ r->rid == RDT_RESOURCE_GMBA ||
r->rid == RDT_RESOURCE_SMBA)
size = ctrl;
else
@@ -2168,13 +2170,18 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name)
static void thread_throttle_mode_init(void)
{
enum membw_throttle_mode throttle_mode = THREAD_THROTTLE_UNDEFINED;
- struct rdt_resource *r_mba, *r_smba;
+ struct rdt_resource *r_mba, *r_gmba, *r_smba;
r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA);
if (r_mba->alloc_capable &&
r_mba->membw.throttle_mode != THREAD_THROTTLE_UNDEFINED)
throttle_mode = r_mba->membw.throttle_mode;
+ r_gmba = resctrl_arch_get_resource(RDT_RESOURCE_GMBA);
+ if (r_gmba->alloc_capable &&
+ r_gmba->membw.throttle_mode != THREAD_THROTTLE_UNDEFINED)
+ throttle_mode = r_gmba->membw.throttle_mode;
+
r_smba = resctrl_arch_get_resource(RDT_RESOURCE_SMBA);
if (r_smba->alloc_capable &&
r_smba->membw.throttle_mode != THREAD_THROTTLE_UNDEFINED)
@@ -2394,6 +2401,7 @@ static unsigned long fflags_from_resource(struct rdt_resource *r)
case RDT_RESOURCE_L2:
return RFTYPE_RES_CACHE;
case RDT_RESOURCE_MBA:
+ case RDT_RESOURCE_GMBA:
case RDT_RESOURCE_SMBA:
return RFTYPE_RES_MB;
case RDT_RESOURCE_PERF_PKG:
@@ -3643,6 +3651,7 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
if (r->rid == RDT_RESOURCE_MBA ||
+ r->rid == RDT_RESOURCE_GMBA ||
r->rid == RDT_RESOURCE_SMBA) {
rdtgroup_init_mba(r, rdtgrp->closid);
if (is_mba_sc(r))
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 006e57fd7ca5..17e12cd3befc 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -52,6 +52,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
+ RDT_RESOURCE_GMBA,
RDT_RESOURCE_SMBA,
RDT_RESOURCE_PERF_PKG,
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 03/19] fs/resctrl: Add new interface max_bandwidth
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
2026-01-21 21:12 ` [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE) Babu Moger
2026-01-21 21:12 ` [RFC PATCH 02/19] x86,fs/resctrl: Add the resource for Global Memory Bandwidth Allocation Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-02-06 23:58 ` Reinette Chatre
2026-01-21 21:12 ` [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation Babu Moger
` (16 subsequent siblings)
19 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
While min_bandwidth is exposed for each resource under
/sys/fs/resctrl, the maximum supported bandwidth is not currently shown.
Add max_bandwidth to report the maximum bandwidth permitted for a resource.
This helps users understand the limits of the associated resource control
group.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/filesystems/resctrl.rst | 6 +++++-
fs/resctrl/rdtgroup.c | 17 +++++++++++++++++
2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index 45dde8774128..94187dd3c244 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -224,7 +224,11 @@ Memory bandwidth(MB) subdirectory contains the following files
with respect to allocation:
"min_bandwidth":
- The minimum memory bandwidth percentage which
+ The minimum memory bandwidth percentage or units which
+ user can request.
+
+"max_bandwidth":
+ The maximum memory bandwidth percentage or units which
user can request.
"bandwidth_gran":
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index ae6c515f4c19..d2eab9007cc1 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1153,6 +1153,16 @@ static int rdt_min_bw_show(struct kernfs_open_file *of,
return 0;
}
+static int rdt_max_bw_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
+ struct rdt_resource *r = s->res;
+
+ seq_printf(seq, "%u\n", r->membw.max_bw);
+ return 0;
+}
+
static int rdt_num_rmids_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
@@ -1959,6 +1969,13 @@ static struct rftype res_common_files[] = {
.seq_show = rdt_min_bw_show,
.fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB,
},
+ {
+ .name = "max_bandwidth",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_max_bw_show,
+ .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB,
+ },
{
.name = "bandwidth_gran",
.mode = 0444,
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (2 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 03/19] fs/resctrl: Add new interface max_bandwidth Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-02-03 0:00 ` Luck, Tony
2026-01-21 21:12 ` [RFC PATCH 05/19] x86,fs/resctrl: Add support for Global Slow Memory Bandwidth Allocation (GSMBA) Babu Moger
` (15 subsequent siblings)
19 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Add the documentation and example to setup Global Memory Bandwidth
Allocation (GMBA) in resctrl filesystem.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/filesystems/resctrl.rst | 43 +++++++++++++++++++++++++--
1 file changed, 41 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index 94187dd3c244..6ff6162719e8 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -28,6 +28,7 @@ SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
ABMC (Assignable Bandwidth Monitoring Counters) ""
SDCIAE (Smart Data Cache Injection Allocation Enforcement) ""
+GMBA (Global Memory Bandwidth Allocation) ""
=============================================================== ================================
Historically, new features were made visible by default in /proc/cpuinfo. This
@@ -960,6 +961,21 @@ Memory bandwidth domain is L3 cache.
MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
+Global Memory bandwidth Allocation
+-----------------------------------
+
+AMD hardware supports Global Memory Bandwidth Allocation (GMBA) provides
+a mechanism for software to specify bandwidth limits for groups of threads
+that span across multiple QoS domains. This collection of QOS domains is
+referred to as GMBA control domain. The GMBA control domain is created by
+setting the same GMBA limits in one or more QoS domains. Setting the default
+max_bandwidth excludes the QoS domain from being part of GMBA control domain.
+
+Global Memory b/w domain is L3 cache.
+::
+
+ GMB:<cache_id0>=bandwidth;<cache_id1>=bandwidth;...
+
Slow Memory Bandwidth Allocation (SMBA)
---------------------------------------
AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
@@ -997,8 +1013,8 @@ which you wish to change. E.g.
Reading/writing the schemata file (on AMD systems)
--------------------------------------------------
Reading the schemata file will show the current bandwidth limit on all
-domains. The allocated resources are in multiples of one eighth GB/s.
-When writing to the file, you need to specify what cache id you wish to
+domains. The allocated resources are in multiples of 1/8 GB/s. When
+writing to the file, you need to specify what cache id you wish to
configure the bandwidth limit.
For example, to allocate 2GB/s limit on the first cache id:
@@ -1014,6 +1030,29 @@ For example, to allocate 2GB/s limit on the first cache id:
MB:0=2048;1= 16;2=2048;3=2048
L3:0=ffff;1=ffff;2=ffff;3=ffff
+Reading/writing the schemata file (on AMD systems) with GMBA feature
+--------------------------------------------------------------------
+Reading the schemata file will show the current bandwidth limit on all
+domains. The allocated resources are in multiples of 1 GB/s. The GMBA
+control domain is created by setting the same GMBA limits in one or more
+QoS domains.
+
+For example, to configure a GMBA domain consisting of domains 0 and 2
+with an 8 GB/s limit:
+
+::
+
+ # cat schemata
+ GMB:0=2048;1=2048;2=2048;3=2048
+ MB:0=4096;1=4096;2=4096;3=4096
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "GMB:0=8;2=8" > schemata
+ # cat schemata
+ GMB:0= 8;1=2048;2= 8;3=2048
+ MB:0=4096;1=4096;2=4096;3=4096
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
Reading/writing the schemata file (on AMD systems) with SMBA feature
--------------------------------------------------------------------
Reading and writing the schemata file is the same as without SMBA in
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 05/19] x86,fs/resctrl: Add support for Global Slow Memory Bandwidth Allocation (GSMBA)
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (3 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 06/19] x86,fs/resctrl: Add the resource for Global Slow Memory Bandwidth Enforcement(GLSBE) Babu Moger
` (14 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
AMD PQoS Global Slow Memory Bandwidth Enforcement(GLSBE) is an extension
to GLBE that provides a mechanism for software to specify slow Memory
bandwidth limits for groups of threads that span multiple QOS Domains.
GLSBE operates within the same GLBE Control Domains defined by GLBE.
Support for GLSBE is indicated by CPUID.8000_0020_EBX_x0[GLSBE](bit 8).
When this bit is set to 1, the platform supports GLSBE.
Since the AMD Slow Memory Bandwidth Enforcement feature is represented
as SMBA, the Global Slow Memory Bandwidth Enforcement feature will be
shown as GSMBA to maintain consistent naming.
Add GSMBA support to resctrl and introduce a kernel parameter that allows
enabling or disabling the feature at boot time.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
arch/x86/kernel/cpu/scattered.c | 1 +
4 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e3058b3d47e9..d3eb21e76aef 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6325,7 +6325,7 @@ Kernel parameters
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, gmba, smba, bmec, abmc, sdciae, energy[:guid],
+ mba, gmba, smba, gsmba, bmec, abmc, sdciae, energy[:guid],
perf[:guid].
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 86d1339cd1bd..57d59399c508 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -513,6 +513,7 @@
*/
#define X86_FEATURE_X2AVIC_EXT (21*32+20) /* AMD SVM x2AVIC support for 4k vCPUs */
#define X86_FEATURE_GMBA (21*32+21) /* Global Memory Bandwidth Allocation */
+#define X86_FEATURE_GSMBA (21*32+22) /* Global Slow Memory Bandwidth Enforcement */
/*
* BUG word(s)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8801dcfb40fb..b4468481d3bf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -818,6 +818,7 @@ enum {
RDT_FLAG_MBA,
RDT_FLAG_GMBA,
RDT_FLAG_SMBA,
+ RDT_FLAG_GSMBA,
RDT_FLAG_BMEC,
RDT_FLAG_ABMC,
RDT_FLAG_SDCIAE,
@@ -846,6 +847,7 @@ static struct rdt_options rdt_options[] __ro_after_init = {
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
RDT_OPT(RDT_FLAG_GMBA, "gmba", X86_FEATURE_GMBA),
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
+ RDT_OPT(RDT_FLAG_GSMBA, "gsmba", X86_FEATURE_GSMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
RDT_OPT(RDT_FLAG_SDCIAE, "sdciae", X86_FEATURE_SDCIAE),
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index d081d167bac9..62894789e345 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -60,6 +60,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_ABMC, CPUID_EBX, 5, 0x80000020, 0 },
{ X86_FEATURE_SDCIAE, CPUID_EBX, 6, 0x80000020, 0 },
{ X86_FEATURE_GMBA, CPUID_EBX, 7, 0x80000020, 0 },
+ { X86_FEATURE_GSMBA, CPUID_EBX, 8, 0x80000020, 0 },
{ X86_FEATURE_TSA_SQ_NO, CPUID_ECX, 1, 0x80000021, 0 },
{ X86_FEATURE_TSA_L1_NO, CPUID_ECX, 2, 0x80000021, 0 },
{ X86_FEATURE_AMD_WORKLOAD_CLASS, CPUID_EAX, 22, 0x80000021, 0 },
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 06/19] x86,fs/resctrl: Add the resource for Global Slow Memory Bandwidth Enforcement(GLSBE)
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (4 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 05/19] x86,fs/resctrl: Add support for Global Slow Memory Bandwidth Allocation (GSMBA) Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 07/19] fs/resctrl: Add the documentation for Global Slow Memory Bandwidth Allocation Babu Moger
` (13 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
AMD PQoS Global Slow Bandwidth Enforcement (GLSBE) provides a mechanism
for software to specify bandwidth limits for groups of threads that span
multiple QoS Domains.
Add the resource definition Global Slow Memory Bandwidth Enforcement
to resctrl filesystem. Resource allows users to configure and manage
the global slow memory bandwidth allocation settings for GLBE control
domain.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 31 ++++++++++++++++++++++++++++++
fs/resctrl/ctrlmondata.c | 4 ++--
fs/resctrl/rdtgroup.c | 16 +++++++++++----
include/linux/resctrl.h | 1 +
5 files changed, 47 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e9b21676102c..0ef1f6a8f4bc 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1275,6 +1275,7 @@
#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
#define MSR_IA32_GMBA_BW_BASE 0xc0000600
+#define MSR_IA32_GSMBA_BW_BASE 0xc0000680
/* AMD-V MSRs */
#define MSR_VM_CR 0xc0010114
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b4468481d3bf..cd208cd71232 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -109,6 +109,15 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
.schema_fmt = RESCTRL_SCHEMA_RANGE,
},
},
+ [RDT_RESOURCE_GSMBA] =
+ {
+ .r_resctrl = {
+ .name = "GSMBA",
+ .ctrl_scope = RESCTRL_L3_CACHE,
+ .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_GSMBA),
+ .schema_fmt = RESCTRL_SCHEMA_RANGE,
+ },
+ },
[RDT_RESOURCE_PERF_PKG] =
{
.r_resctrl = {
@@ -261,6 +270,9 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
case RDT_RESOURCE_GMBA:
subleaf = 7;
break;
+ case RDT_RESOURCE_GSMBA:
+ subleaf = 8;
+ break;
default:
return false;
}
@@ -958,6 +970,19 @@ static __init bool get_slow_mem_config(void)
return false;
}
+static __init bool get_gslow_mem_config(void)
+{
+ struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_GSMBA];
+
+ if (!rdt_cpu_has(X86_FEATURE_GSMBA))
+ return false;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+ return __rdt_get_mem_config_amd(&hw_res->r_resctrl);
+
+ return false;
+}
+
static __init bool get_rdt_alloc_resources(void)
{
struct rdt_resource *r;
@@ -996,6 +1021,9 @@ static __init bool get_rdt_alloc_resources(void)
if (get_slow_mem_config())
ret = true;
+ if (get_gslow_mem_config())
+ ret = true;
+
return ret;
}
@@ -1099,6 +1127,9 @@ static __init void rdt_init_res_defs_amd(void)
} else if (r->rid == RDT_RESOURCE_SMBA) {
hw_res->msr_base = MSR_IA32_SMBA_BW_BASE;
hw_res->msr_update = mba_wrmsr_amd;
+ } else if (r->rid == RDT_RESOURCE_GSMBA) {
+ hw_res->msr_base = MSR_IA32_GSMBA_BW_BASE;
+ hw_res->msr_update = mba_wrmsr_amd;
}
}
}
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index ad7327b90d3f..5c529de24612 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -247,8 +247,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
(r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_GMBA ||
- r->rid == RDT_RESOURCE_SMBA)) {
- rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
+ r->rid == RDT_RESOURCE_SMBA || r->rid == RDT_RESOURCE_GSMBA)) {
+ rdt_last_cmd_puts("Cannot pseudo-lock MBA/SMBA resource\n");
return -EINVAL;
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index d2eab9007cc1..fc034f4481e3 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1423,7 +1423,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
list_for_each_entry(s, &resctrl_schema_all, list) {
r = s->res;
if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_GMBA ||
- r->rid == RDT_RESOURCE_SMBA)
+ r->rid == RDT_RESOURCE_SMBA || r->rid == RDT_RESOURCE_GSMBA)
continue;
has_cache = true;
list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
@@ -1627,7 +1627,8 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
type);
if (r->rid == RDT_RESOURCE_MBA ||
r->rid == RDT_RESOURCE_GMBA ||
- r->rid == RDT_RESOURCE_SMBA)
+ r->rid == RDT_RESOURCE_SMBA ||
+ r->rid == RDT_RESOURCE_GSMBA)
size = ctrl;
else
size = rdtgroup_cbm_to_size(r, d, ctrl);
@@ -2187,7 +2188,7 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name)
static void thread_throttle_mode_init(void)
{
enum membw_throttle_mode throttle_mode = THREAD_THROTTLE_UNDEFINED;
- struct rdt_resource *r_mba, *r_gmba, *r_smba;
+ struct rdt_resource *r_mba, *r_gmba, *r_smba, *r_gsmba;
r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA);
if (r_mba->alloc_capable &&
@@ -2204,6 +2205,11 @@ static void thread_throttle_mode_init(void)
r_smba->membw.throttle_mode != THREAD_THROTTLE_UNDEFINED)
throttle_mode = r_smba->membw.throttle_mode;
+ r_gsmba = resctrl_arch_get_resource(RDT_RESOURCE_GSMBA);
+ if (r_gsmba->alloc_capable &&
+ r_gsmba->membw.throttle_mode != THREAD_THROTTLE_UNDEFINED)
+ throttle_mode = r_gsmba->membw.throttle_mode;
+
if (throttle_mode == THREAD_THROTTLE_UNDEFINED)
return;
@@ -2420,6 +2426,7 @@ static unsigned long fflags_from_resource(struct rdt_resource *r)
case RDT_RESOURCE_MBA:
case RDT_RESOURCE_GMBA:
case RDT_RESOURCE_SMBA:
+ case RDT_RESOURCE_GSMBA:
return RFTYPE_RES_MB;
case RDT_RESOURCE_PERF_PKG:
return RFTYPE_RES_PERF_PKG;
@@ -3669,7 +3676,8 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp)
r = s->res;
if (r->rid == RDT_RESOURCE_MBA ||
r->rid == RDT_RESOURCE_GMBA ||
- r->rid == RDT_RESOURCE_SMBA) {
+ r->rid == RDT_RESOURCE_SMBA ||
+ r->rid == RDT_RESOURCE_GSMBA) {
rdtgroup_init_mba(r, rdtgrp->closid);
if (is_mba_sc(r))
continue;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 17e12cd3befc..63d74c0dbb8f 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -54,6 +54,7 @@ enum resctrl_res_level {
RDT_RESOURCE_MBA,
RDT_RESOURCE_GMBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_GSMBA,
RDT_RESOURCE_PERF_PKG,
/* Must be the last */
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 07/19] fs/resctrl: Add the documentation for Global Slow Memory Bandwidth Allocation
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (5 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 06/19] x86,fs/resctrl: Add the resource for Global Slow Memory Bandwidth Enforcement(GLSBE) Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 08/19] x86/resctrl: Support Privilege-Level Zero Association (PLZA) Babu Moger
` (12 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Add the documentation and example to setup Global Slow Memory Bandwidth
Allocation (GSMBA) in resctrl filesystem.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/filesystems/resctrl.rst | 39 +++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index 6ff6162719e8..3d66814a1d7f 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -29,6 +29,7 @@ BMEC (Bandwidth Monitoring Event Configuration) ""
ABMC (Assignable Bandwidth Monitoring Counters) ""
SDCIAE (Smart Data Cache Injection Allocation Enforcement) ""
GMBA (Global Memory Bandwidth Allocation) ""
+GSMBA (Global Slow Memory Bandwidth Allocation) ""
=============================================================== ================================
Historically, new features were made visible by default in /proc/cpuinfo. This
@@ -995,6 +996,19 @@ is formatted as:
SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+Global Slow Memory bandwidth Allocation (GSMBA)
+-----------------------------------------------
+
+AMD hardware supports Global Slow Memory Bandwidth Allocation (GSMBA)
+provides a mechanism for software to specify bandwidth limits for groups
+of threads that span across multiple QoS Domains. It operates Similarly
+to GMBA, however, the target resource is slow memory.
+
+Global Slow Memory b/w domain is L3 cache.
+::
+
+ GSMBA:<cache_id0>=bandwidth;<cache_id1>=bandwidth;...
+
Reading/writing the schemata file
---------------------------------
Reading the schemata file will show the state of all resources
@@ -1073,6 +1087,31 @@ For example, to allocate 8GB/s limit on the first cache id:
MB:0=2048;1=2048;2=2048;3=2048
L3:0=ffff;1=ffff;2=ffff;3=ffff
+Reading/writing the schemata file (on AMD systems) with GSMBA feature
+---------------------------------------------------------------------
+Reading the schemata file will show the current bandwidth limit on all
+domains. The allocated resources are in multiples of 1 GB/s. The GSMBA
+control domain is created by setting the same GSMBA limits in one or
+more QoS domains.
+
+For example, to configure a GSMBA domain consisting of domains 0 and 2
+with an 8 GB/s limit:
+
+::
+
+ # cat schemata
+ GSMBA:0=2048;1=2048;2=2048;3=2048
+ SMBA:0=2048;1=2048;2=2048;3=2048
+ MB:0=4096;1=4096;2=4096;3=4096
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "GSMBA:0=8;2=8" > schemata
+ # cat schemata
+ GSMBA:0= 8;1=2048;2= 8;3=2048
+ SMBA:0=2048;1=2048;2=2048;3=2048
+ MB:0=4096;1=4096;2=4096;3=4096
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
Cache Pseudo-Locking
====================
CAT enables a user to specify the amount of cache space that an
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 08/19] x86/resctrl: Support Privilege-Level Zero Association (PLZA)
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (6 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 07/19] fs/resctrl: Add the documentation for Global Slow Memory Bandwidth Allocation Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure Babu Moger
` (11 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Customers have identified an issue while using the QoS resource Control
feature. If a memory bandwidth associated with a CLOSID is aggressively
throttled, and it moves into Kernel mode, the Kernel operations are also
aggressively throttled. This can stall forward progress and eventually
degrade overall system performance. AMD hardware supports a feature
Privilege-Level Zero Association (PLZA) to change the association of the
thread as soon as it begins executing.
Privilege-Level Zero Association (PLZA) allows the user to specify a CLOSID
and/or RMID associated with execution in Privilege-Level Zero. When enabled
on a HW thread, when the thread enters Privilege-Level Zero, transactions
associated with that thread will be associated with the PLZA CLOSID and/or
RMID. Otherwise, the HW thread will be associated with the CLOSID and RMID
identified by PQR_ASSOC.
Add PLZA support to resctrl and introduce a kernel parameter that allows
enabling or disabling the feature at boot time.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
arch/x86/kernel/cpu/scattered.c | 1 +
4 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d3eb21e76aef..4ce3a291cd68 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6325,7 +6325,7 @@ Kernel parameters
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, gmba, smba, gsmba, bmec, abmc, sdciae, energy[:guid],
+ mba, gmba, smba, gsmba, bmec, abmc, sdciae, plza, energy[:guid],
perf[:guid].
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 57d59399c508..0c3b44836cfe 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -514,6 +514,7 @@
#define X86_FEATURE_X2AVIC_EXT (21*32+20) /* AMD SVM x2AVIC support for 4k vCPUs */
#define X86_FEATURE_GMBA (21*32+21) /* Global Memory Bandwidth Allocation */
#define X86_FEATURE_GSMBA (21*32+22) /* Global Slow Memory Bandwidth Enforcement */
+#define X86_FEATURE_PLZA (21*32+23) /* Privilege-Level Zero Association */
/*
* BUG word(s)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index cd208cd71232..2de3140dd6d1 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -834,6 +834,7 @@ enum {
RDT_FLAG_BMEC,
RDT_FLAG_ABMC,
RDT_FLAG_SDCIAE,
+ RDT_FLAG_PLZA,
};
#define RDT_OPT(idx, n, f) \
@@ -863,6 +864,7 @@ static struct rdt_options rdt_options[] __ro_after_init = {
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
RDT_OPT(RDT_FLAG_SDCIAE, "sdciae", X86_FEATURE_SDCIAE),
+ RDT_OPT(RDT_FLAG_PLZA, "plza", X86_FEATURE_PLZA),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 62894789e345..4c98c8c5359f 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -61,6 +61,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_SDCIAE, CPUID_EBX, 6, 0x80000020, 0 },
{ X86_FEATURE_GMBA, CPUID_EBX, 7, 0x80000020, 0 },
{ X86_FEATURE_GSMBA, CPUID_EBX, 8, 0x80000020, 0 },
+ { X86_FEATURE_PLZA, CPUID_EBX, 9, 0x80000020, 0 },
{ X86_FEATURE_TSA_SQ_NO, CPUID_ECX, 1, 0x80000021, 0 },
{ X86_FEATURE_TSA_L1_NO, CPUID_ECX, 2, 0x80000021, 0 },
{ X86_FEATURE_AMD_WORKLOAD_CLASS, CPUID_EAX, 22, 0x80000021, 0 },
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (7 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 08/19] x86/resctrl: Support Privilege-Level Zero Association (PLZA) Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-02-11 15:19 ` Ben Horgan
2026-01-21 21:12 ` [RFC PATCH 10/19] fs/resctrl: Expose plza_capable via control info file Babu Moger
` (10 subsequent siblings)
19 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Add plza_capable field to the rdt_resource structure to indicate whether
Privilege Level Zero Association (PLZA) is supported for that resource
type.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++++
include/linux/resctrl.h | 3 +++
3 files changed, 14 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 2de3140dd6d1..e41fe5fa3f30 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -295,6 +295,9 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
r->alloc_capable = true;
+ if (rdt_cpu_has(X86_FEATURE_PLZA))
+ r->plza_capable = true;
+
return true;
}
@@ -314,6 +317,9 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
r->cache.arch_has_sparse_bitmasks = ecx.split.noncont;
r->alloc_capable = true;
+
+ if (rdt_cpu_has(X86_FEATURE_PLZA))
+ r->plza_capable = true;
}
static void rdt_get_cdp_config(int level)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 885026468440..540e1e719d7f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -229,6 +229,11 @@ bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
return rdt_resources_all[l].cdp_enabled;
}
+bool resctrl_arch_get_plza_capable(enum resctrl_res_level l)
+{
+ return rdt_resources_all[l].r_resctrl.plza_capable;
+}
+
void resctrl_arch_reset_all_ctrls(struct rdt_resource *r)
{
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 63d74c0dbb8f..ae252a0e6d92 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -319,6 +319,7 @@ struct resctrl_mon {
* @name: Name to use in "schemata" file.
* @schema_fmt: Which format string and parser is used for this schema.
* @cdp_capable: Is the CDP feature available on this resource
+ * @plza_capable: Is Privilege Level Zero Association capable?
*/
struct rdt_resource {
int rid;
@@ -334,6 +335,7 @@ struct rdt_resource {
char *name;
enum resctrl_schema_fmt schema_fmt;
bool cdp_capable;
+ bool plza_capable;
};
/*
@@ -481,6 +483,7 @@ static inline u32 resctrl_get_config_index(u32 closid,
bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
+bool resctrl_arch_get_plza_capable(enum resctrl_res_level l);
/**
* resctrl_arch_mbm_cntr_assign_enabled() - Check if MBM counter assignment
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 10/19] fs/resctrl: Expose plza_capable via control info file
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (8 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 11/19] resctrl: Introduce PLZA static key enable/disable helpers Babu Moger
` (9 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Add a new resctrl info file, plza_capable, to report whether a resource
supports the PLZA capability. Allows user to query PLZA support directly
through resctrl without having to infer it from other resource attributes.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/filesystems/resctrl.rst | 17 +++++++++++++++++
fs/resctrl/rdtgroup.c | 17 +++++++++++++++++
2 files changed, 34 insertions(+)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index 3d66814a1d7f..1de55b5cb0e3 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -30,6 +30,7 @@ ABMC (Assignable Bandwidth Monitoring Counters) ""
SDCIAE (Smart Data Cache Injection Allocation Enforcement) ""
GMBA (Global Memory Bandwidth Allocation) ""
GSMBA (Global Slow Memory Bandwidth Allocation) ""
+PLZA (Privilege Level Zero association) ""
=============================================================== ================================
Historically, new features were made visible by default in /proc/cpuinfo. This
@@ -151,6 +152,22 @@ related to allocation:
"1":
Non-contiguous 1s value in CBM is supported.
+"plza_capable":
+ Indicates the availability of Privilege Level Zero Association (PLZA).
+ PLZA is a hardware feature that enables automatic association of execution
+ at Privilege Level Zero (CPL=0) with a designated Class of Service
+ Identifier (CLOSID) and/or Resource Monitoring Identifier (RMID).
+ This mechanism allows the system to override the default per-thread
+ association for threads operating at CPL=0 when necessary. Additionally,
+ PLZA provides configuration capabilities for defining a dedicated resource
+ control group and assigning CPUs and tasks to operate under CLOSID
+ constraints reserved exclusively for PLZA.
+
+ "1":
+ Resource supports the feature.
+ "0":
+ Support not available for this resource.
+
"io_alloc":
"io_alloc" enables system software to configure the portion of
the cache allocated for I/O traffic. File may only exist if the
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index fc034f4481e3..d773bf77bcc6 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1260,6 +1260,16 @@ static ssize_t max_threshold_occ_write(struct kernfs_open_file *of,
return nbytes;
}
+static int rdt_plza_show(struct kernfs_open_file *of, struct seq_file *seq, void *v)
+{
+ struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
+ struct rdt_resource *r = s->res;
+
+ seq_printf(seq, "%d\n", r->plza_capable);
+
+ return 0;
+}
+
/*
* rdtgroup_mode_show - Display mode of this resource group
*/
@@ -1991,6 +2001,13 @@ static struct rftype res_common_files[] = {
.seq_show = rdt_delay_linear_show,
.fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB,
},
+ {
+ .name = "plza_capable",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_plza_show,
+ .fflags = RFTYPE_CTRL_INFO,
+ },
/*
* Platform specific which (if any) capabilities are provided by
* thread_throttle_mode. Defer "fflags" initialization to platform
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 11/19] resctrl: Introduce PLZA static key enable/disable helpers
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (9 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 10/19] fs/resctrl: Expose plza_capable via control info file Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 12/19] x86/resctrl: Add data structures and definitions for PLZA configuration Babu Moger
` (8 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
The resctrl subsystem uses static keys to efficiently toggle allocation and
monitoring features at runtime (e.g., rdt_alloc_enable_key,
rdt_mon_enable_key). Privilege-Level Zero Association (PLZA) is a new,
optional capability that should only impact fast paths when enabled.
Introduce a new static key, rdt_plza_enable_key, and wire it up with arch
helpers that mirror the existing alloc/mon pattern. This provides a
lightweight, unified mechanism to guard PLZA-specific paths and to keep the
global resctrl usage count accurate.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/include/asm/resctrl.h | 13 +++++++++++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 ++
fs/resctrl/rdtgroup.c | 4 ++++
3 files changed, 19 insertions(+)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 575f8408a9e7..fc0a7f64649e 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -48,6 +48,7 @@ extern bool rdt_mon_capable;
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_plza_enable_key);
static inline bool resctrl_arch_alloc_capable(void)
{
@@ -83,6 +84,18 @@ static inline void resctrl_arch_disable_mon(void)
static_branch_dec_cpuslocked(&rdt_enable_key);
}
+static inline void resctrl_arch_enable_plza(void)
+{
+ static_branch_enable_cpuslocked(&rdt_plza_enable_key);
+ static_branch_inc_cpuslocked(&rdt_enable_key);
+}
+
+static inline void resctrl_arch_disable_plza(void)
+{
+ static_branch_disable_cpuslocked(&rdt_plza_enable_key);
+ static_branch_dec_cpuslocked(&rdt_enable_key);
+}
+
/*
* __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 540e1e719d7f..fe530216a6cc 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -38,6 +38,8 @@ DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+DEFINE_STATIC_KEY_FALSE(rdt_plza_enable_key);
+
/*
* This is safe against resctrl_arch_sched_in() called from __switch_to()
* because __switch_to() is executed with interrupts disabled. A local call
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index d773bf77bcc6..616be6633a6d 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2910,6 +2910,8 @@ static int rdt_get_tree(struct fs_context *fc)
resctrl_arch_enable_alloc();
if (resctrl_arch_mon_capable())
resctrl_arch_enable_mon();
+ if (resctrl_arch_get_plza_capable(RDT_RESOURCE_L3))
+ resctrl_arch_enable_plza();
if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable())
resctrl_mounted = true;
@@ -3232,6 +3234,8 @@ static void rdt_kill_sb(struct super_block *sb)
resctrl_arch_disable_alloc();
if (resctrl_arch_mon_capable())
resctrl_arch_disable_mon();
+ if (resctrl_arch_get_plza_capable(RDT_RESOURCE_L3))
+ resctrl_arch_disable_plza();
resctrl_mounted = false;
kernfs_kill_sb(sb);
mutex_unlock(&rdtgroup_mutex);
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 12/19] x86/resctrl: Add data structures and definitions for PLZA configuration
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (10 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 11/19] resctrl: Introduce PLZA static key enable/disable helpers Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling Babu Moger
` (7 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Privilege Level Zero Association (PLZA) is configured with a Per Logical
Processor MSR: MSR_IA32_PQR_PLZA_ASSOC (0xc00003fc).
Add the necessary data structures and definitions to support PLZA
configuration.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/include/asm/msr-index.h | 7 +++++++
arch/x86/kernel/cpu/resctrl/internal.h | 26 ++++++++++++++++++++++++++
2 files changed, 33 insertions(+)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0ef1f6a8f4bc..d42d31beaf3a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1271,12 +1271,19 @@
/* - AMD: */
#define MSR_IA32_MBA_BW_BASE 0xc0000200
#define MSR_IA32_SMBA_BW_BASE 0xc0000280
+#define MSR_IA32_PQR_PLZA_ASSOC 0xc00003fc
#define MSR_IA32_L3_QOS_ABMC_CFG 0xc00003fd
#define MSR_IA32_L3_QOS_EXT_CFG 0xc00003ff
#define MSR_IA32_EVT_CFG_BASE 0xc0000400
#define MSR_IA32_GMBA_BW_BASE 0xc0000600
#define MSR_IA32_GSMBA_BW_BASE 0xc0000680
+/* Lower 32 bits of MSR_IA32_PQR_PLZA_ASSOC */
+#define RMID_EN BIT(31)
+/* Uppper 32 bits of MSR_IA32_PQR_PLZA_ASSOC */
+#define CLOSID_EN BIT(15)
+#define PLZA_EN BIT(31)
+
/* AMD-V MSRs */
#define MSR_VM_CR 0xc0010114
#define MSR_VM_IGNNE 0xc0010115
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 61a283652d39..4ea1ba659a01 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -219,6 +219,32 @@ union l3_qos_abmc_cfg {
unsigned long full;
};
+/*
+ * PLZA can be configured on a CPU by writing to MSR_IA32_PQR_PLZA_ASSOC.
+ *
+ * @rmid : The RMID to be configured for PLZA.
+ * @reserved1 : Reserved.
+ * @rmidid_en : Asociate RMID or not.
+ * @closid : The CLOSID to be configured for PLZA.
+ * @reserved2 : Reserved.
+ * @closid_en : Asociate CLOSID or not.
+ * @reserved3 : Reserved.
+ * @plza_en : configure PLZA or not
+ */
+union qos_pqr_plza_assoc {
+ struct {
+ unsigned long rmid :12,
+ reserved1 :19,
+ rmid_en : 1,
+ closid : 4,
+ reserved2 :11,
+ closid_en : 1,
+ reserved3 :15,
+ plza_en : 1;
+ } split;
+ unsigned long full;
+};
+
void rdt_ctrl_update(void *arg);
int rdt_get_l3_mon_config(struct rdt_resource *r);
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (11 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 12/19] x86/resctrl: Add data structures and definitions for PLZA configuration Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-27 22:30 ` Luck, Tony
2026-01-21 21:12 ` [RFC PATCH 14/19] x86,fs/resctrl: Add the functionality to configure PLZA Babu Moger
` (6 subsequent siblings)
19 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
The resctrl subsystem writes the task's RMID/CLOSID to IA32_PQR_ASSOC in
__resctrl_sched_in(). With PLZA support being introduced and guarded by
rdt_plza_enable_key, the kernel needs a way to track and program the PLZA
association independently of the regular RMID/CLOSID path.
Extend the per-CPU resctrl_pqr_state to track PLZA-related state, including
the current and default PLZA values along with the associated RMID and
CLOSID.
Update the resctrl scheduling-in path to program the PLZA MSR when PLZA
support is enabled. During the context switch, the task-specific PLZA
setting is applied if present; otherwise, the per-CPU default PLZA value is
used. The MSR is only written when the PLZA state changes, avoiding
unnecessary writes.
PLZA programming is guarded by a static key to ensure there is no overhead
when the feature is disabled.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/include/asm/resctrl.h | 19 +++++++++++++++++++
include/linux/sched.h | 1 +
2 files changed, 20 insertions(+)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index fc0a7f64649e..76de7d6051b7 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -38,6 +38,10 @@ struct resctrl_pqr_state {
u32 cur_closid;
u32 default_rmid;
u32 default_closid;
+ u32 cur_plza;
+ u32 default_plza;
+ u32 plza_rmid;
+ u32 plza_closid;
};
DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
@@ -115,6 +119,7 @@ static inline void __resctrl_sched_in(struct task_struct *tsk)
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
u32 closid = READ_ONCE(state->default_closid);
u32 rmid = READ_ONCE(state->default_rmid);
+ u32 plza = READ_ONCE(state->default_plza);
u32 tmp;
/*
@@ -138,6 +143,20 @@ static inline void __resctrl_sched_in(struct task_struct *tsk)
state->cur_rmid = rmid;
wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
}
+
+ if (static_branch_likely(&rdt_plza_enable_key)) {
+ tmp = READ_ONCE(tsk->plza);
+ if (tmp)
+ plza = tmp;
+
+ if (plza != state->cur_plza) {
+ state->cur_plza = plza;
+ wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
+ RMID_EN | state->plza_rmid,
+ (plza ? PLZA_EN : 0) | CLOSID_EN | state->plza_closid);
+ }
+ }
+
}
static inline unsigned int resctrl_arch_round_mon_val(unsigned int val)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f3a60f13393..d573163865ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1326,6 +1326,7 @@ struct task_struct {
#ifdef CONFIG_X86_CPU_RESCTRL
u32 closid;
u32 rmid;
+ u32 plza;
#endif
#ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 14/19] x86,fs/resctrl: Add the functionality to configure PLZA
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (12 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-29 19:13 ` Luck, Tony
2026-01-21 21:12 ` [RFC PATCH 15/19] fs/resctrl: Introduce PLZA attribute in rdtgroup interface Babu Moger
` (5 subsequent siblings)
19 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Privilege Level Zero Association (PLZA) is configured by writing to
MSR_IA32_PQR_PLZA_ASSOC. PLZA is disabled by default on all logical
processors in the QOS Domain. System software must follow the following
sequence.
1. Set the closid, closid_en, rmid and rmid_en fields of
MSR_IA32_PQR_PLZA_ASSOC to the desired configuration on all logical
processors in the QOS Domain.
2. Set MSR_IA32_PQR_PLZA_ASSOC[PLZA_EN]=1 for
all logical processors in the QOS domain where PLZA should be enabled.
MSR_IA32_PQR_PLZA_ASSOC[PLZA_EN] may have a different value on every
logical processor in the QOS domain. The system software should perform
this as a read-modify-write to avoid changing the value of closid_en,
closid, rmid_en, and rmid fields of MSR_IA32_PQR_PLZA_ASSOC.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/include/asm/resctrl.h | 7 +++++++
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 25 +++++++++++++++++++++++
include/linux/resctrl.h | 11 ++++++++++
3 files changed, 43 insertions(+)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 76de7d6051b7..89b38948be1a 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -193,6 +193,13 @@ static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored,
return READ_ONCE(tsk->rmid) == rmid;
}
+static inline void resctrl_arch_set_cpu_plza(int cpu, u32 closid, u32 rmid, u32 enable)
+{
+ WRITE_ONCE(per_cpu(pqr_state.default_plza, cpu), enable);
+ WRITE_ONCE(per_cpu(pqr_state.plza_closid, cpu), closid);
+ WRITE_ONCE(per_cpu(pqr_state.plza_rmid, cpu), rmid);
+}
+
static inline void resctrl_arch_sched_in(struct task_struct *tsk)
{
if (static_branch_likely(&rdt_enable_key))
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b20e705606b8..79ed41bde810 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -131,3 +131,28 @@ int resctrl_arch_io_alloc_enable(struct rdt_resource *r, bool enable)
return 0;
}
+
+static void resctrl_plza_set_one_amd(void *arg)
+{
+ union qos_pqr_plza_assoc *plza = arg;
+
+ wrmsrl(MSR_IA32_PQR_PLZA_ASSOC, plza->full);
+}
+
+void resctrl_arch_plza_setup(struct rdt_resource *r, u32 closid, u32 rmid)
+{
+ union qos_pqr_plza_assoc plza = { 0 };
+ struct rdt_ctrl_domain *d;
+ int cpu;
+
+ plza.split.rmid = rmid;
+ plza.split.rmid_en = 1;
+ plza.split.closid = closid;
+ plza.split.closid_en = 1;
+
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
+ for_each_cpu(cpu, &d->hdr.cpu_mask)
+ resctrl_arch_set_cpu_plza(cpu, closid, rmid, 0);
+ on_each_cpu_mask(&d->hdr.cpu_mask, resctrl_plza_set_one_amd, &plza, 1);
+ }
+}
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ae252a0e6d92..ef26253ad24a 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -704,6 +704,17 @@ int resctrl_arch_io_alloc_enable(struct rdt_resource *r, bool enable);
*/
bool resctrl_arch_get_io_alloc_enabled(struct rdt_resource *r);
+/*
+ * resctrl_arch_plza_setup() - Reset all private state associated with
+ * all rmids and eventids.
+ * @r: The resctrl resource.
+ * @closid: The CLOSID to be configered for PLZA.
+ * @rmid: The RMID to be configered for PLZA.
+ *
+ * This can be called from any CPU.
+ */
+void resctrl_arch_plza_setup(struct rdt_resource *r, u32 closid, u32 rmid);
+
extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 15/19] fs/resctrl: Introduce PLZA attribute in rdtgroup interface
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (13 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 14/19] x86,fs/resctrl: Add the functionality to configure PLZA Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group Babu Moger
` (4 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Add plza attribute to display Privilege Level Zero Association for the
resctrl group.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
fs/resctrl/internal.h | 2 ++
fs/resctrl/rdtgroup.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 34 insertions(+)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 1a9b29119f88..c107c1328be6 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -215,6 +215,7 @@ struct mongroup {
* monitor only or ctrl_mon group
* @mon: mongroup related data
* @mode: mode of resource group
+ * @plza: Is Privilege Level Zero Association enabled?
* @mba_mbps_event: input monitoring event id when mba_sc is enabled
* @plr: pseudo-locked region
*/
@@ -228,6 +229,7 @@ struct rdtgroup {
enum rdt_group_type type;
struct mongroup mon;
enum rdtgrp_mode mode;
+ bool plza;
enum resctrl_event_id mba_mbps_event;
struct pseudo_lock_region *plr;
};
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 616be6633a6d..d467b52a0c74 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -880,6 +880,22 @@ static int rdtgroup_rmid_show(struct kernfs_open_file *of,
return ret;
}
+static int rdtgroup_plza_show(struct kernfs_open_file *of,
+ struct seq_file *s, void *v)
+{
+ struct rdtgroup *rdtgrp;
+ int ret = 0;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (rdtgrp)
+ seq_printf(s, "%u\n", rdtgrp->plza);
+ else
+ ret = -ENOENT;
+ rdtgroup_kn_unlock(of->kn);
+
+ return ret;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL
/*
* A task can only be part of one resctrl control group and of one monitor
@@ -2153,6 +2169,12 @@ static struct rftype res_common_files[] = {
.seq_show = rdtgroup_closid_show,
.fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG,
},
+ {
+ .name = "plza",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdtgroup_plza_show,
+ },
};
static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
@@ -2251,6 +2273,14 @@ static void io_alloc_init(void)
}
}
+static void resctrl_plza_init(void)
+{
+ struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+
+ if (r->plza_capable)
+ resctrl_file_fflags_init("plza", RFTYPE_CTRL_BASE);
+}
+
void resctrl_file_fflags_init(const char *config, unsigned long fflags)
{
struct rftype *rft;
@@ -4609,6 +4639,8 @@ int resctrl_init(void)
io_alloc_init();
+ resctrl_plza_init();
+
ret = resctrl_l3_mon_resource_init();
if (ret)
return ret;
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (14 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 15/19] fs/resctrl: Introduce PLZA attribute in rdtgroup interface Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-28 22:03 ` Luck, Tony
2026-02-10 0:05 ` Reinette Chatre
2026-01-21 21:12 ` [RFC PATCH 17/19] fs/resctrl: Update PLZA configuration when cpu_mask changes Babu Moger
` (3 subsequent siblings)
19 siblings, 2 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Introduce rdtgroup_plza_write() group which enables per group control of
PLZA through the resctrl filesystem and ensure that enabling or disabling
PLZA is propagated consistently across all CPUs belonging to the group.
Enforce the capability checks, exclude default, pseudo-locked and CTRL_MON
groups with sub monitors. Also, ensure that only one group can have PLZA
enabled at a time.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
Documentation/filesystems/resctrl.rst | 5 ++
fs/resctrl/rdtgroup.c | 88 ++++++++++++++++++++++++++-
2 files changed, 92 insertions(+), 1 deletion(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index 1de55b5cb0e3..8edcc047ffe5 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -626,6 +626,11 @@ When control is enabled all CTRL_MON groups will also contain:
Available only with debug option. The identifier used by hardware
for the control group. On x86 this is the CLOSID.
+"plza":
+ When enabled, CPUs or tasks in the resctrl group follow the group's
+ limits while running at Privilege Level 0 (CPL-0). This can only be
+ enabled for CTRL_MON groups.
+
When monitoring is enabled all MON groups will also contain:
"mon_data":
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index d467b52a0c74..042ae7d63aea 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -896,6 +896,76 @@ static int rdtgroup_plza_show(struct kernfs_open_file *of,
return ret;
}
+static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+ struct rdtgroup *rdtgrp, *prgrp;
+ int cpu, ret = 0;
+ bool enable;
+
+ ret = kstrtobool(buf, &enable);
+ if (ret)
+ return ret;
+
+ rdtgrp = rdtgroup_kn_lock_live(of->kn);
+ if (!rdtgrp) {
+ rdtgroup_kn_unlock(of->kn);
+ return -ENOENT;
+ }
+
+ rdt_last_cmd_clear();
+
+ if (!r->plza_capable) {
+ rdt_last_cmd_puts("PLZA is not supported in the system\n");
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (rdtgrp == &rdtgroup_default) {
+ rdt_last_cmd_puts("Cannot set PLZA on a default group\n");
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
+ rdt_last_cmd_puts("Resource group is pseudo-locked\n");
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (!list_empty(&rdtgrp->mon.crdtgrp_list)) {
+ rdt_last_cmd_puts("Cannot change CTRL_MON group with sub monitor groups\n");
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+ if (prgrp == rdtgrp)
+ continue;
+ if (enable && prgrp->plza) {
+ rdt_last_cmd_puts("PLZA is already configured on another group\n");
+ ret = -EINVAL;
+ goto unlock;
+ }
+ }
+
+ /* Enable or disable PLZA state and update per CPU state if there is a change */
+ if (enable != rdtgrp->plza) {
+ resctrl_arch_plza_setup(r, rdtgrp->closid, rdtgrp->mon.rmid);
+
+ for_each_cpu(cpu, &rdtgrp->cpu_mask)
+ resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
+ rdtgrp->mon.rmid, enable);
+ rdtgrp->plza = enable;
+ }
+
+unlock:
+ rdtgroup_kn_unlock(of->kn);
+
+ return ret ?: nbytes;
+}
+
#ifdef CONFIG_PROC_CPU_RESCTRL
/*
* A task can only be part of one resctrl control group and of one monitor
@@ -2171,8 +2241,9 @@ static struct rftype res_common_files[] = {
},
{
.name = "plza",
- .mode = 0444,
+ .mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
+ .write = rdtgroup_plza_write,
.seq_show = rdtgroup_plza_show,
},
};
@@ -3126,11 +3197,19 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
static void rmdir_all_sub(void)
{
struct rdtgroup *rdtgrp, *tmp;
+ int cpu;
/* Move all tasks to the default resource group */
rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
+ if (rdtgrp->plza) {
+ for_each_cpu(cpu, &rdtgrp->cpu_mask)
+ resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
+ rdtgrp->mon.rmid, false);
+ rdtgrp->plza = 0;
+ }
+
/* Free any child rmids */
free_all_child_rdtgrp(rdtgrp);
@@ -4090,6 +4169,13 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
u32 closid, rmid;
int cpu;
+ if (rdtgrp->plza) {
+ for_each_cpu(cpu, &rdtgrp->cpu_mask)
+ resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
+ rdtgrp->mon.rmid, false);
+ rdtgrp->plza = 0;
+ }
+
/* Give any tasks back to the default group */
rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 17/19] fs/resctrl: Update PLZA configuration when cpu_mask changes
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (15 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 18/19] x86/resctrl: Refactor show_rdt_tasks() to support PLZA task matching Babu Moger
` (2 subsequent siblings)
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
When PLZA is active for a resctrl group, per-CPU PLZA state must track CPU
mask changes.
Introduce cpus_ctrl_plza_write() to update PLZA on CPUs entering or leaving
the group.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
fs/resctrl/rdtgroup.c | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 042ae7d63aea..bea017f9bd40 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -508,6 +508,33 @@ static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask,
return 0;
}
+static int cpus_ctrl_plza_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask,
+ cpumask_var_t tmpmask)
+{
+ int cpu;
+
+ /* Check if the cpus are dropped from the group */
+ cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
+ if (!cpumask_empty(tmpmask)) {
+ for_each_cpu(cpu, tmpmask)
+ resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
+ rdtgrp->mon.rmid, false);
+ }
+
+ /* If the CPUs are added then enable PLZA on the added CPUs. */
+ cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask);
+ if (!cpumask_empty(tmpmask)) {
+ for_each_cpu(cpu, tmpmask)
+ resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
+ rdtgrp->mon.rmid, true);
+ }
+
+ /* Update the group with new mask */
+ cpumask_copy(&rdtgrp->cpu_mask, newmask);
+
+ return 0;
+}
+
static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -563,7 +590,9 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
goto unlock;
}
- if (rdtgrp->type == RDTCTRL_GROUP)
+ if (rdtgrp->plza)
+ ret = cpus_ctrl_plza_write(rdtgrp, newmask, tmpmask);
+ else if (rdtgrp->type == RDTCTRL_GROUP)
ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1);
else if (rdtgrp->type == RDTMON_GROUP)
ret = cpus_mon_write(rdtgrp, newmask, tmpmask);
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 18/19] x86/resctrl: Refactor show_rdt_tasks() to support PLZA task matching
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (16 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 17/19] fs/resctrl: Update PLZA configuration when cpu_mask changes Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 19/19] fs/resctrl: Add per-task PLZA enable support via rdtgroup Babu Moger
2026-02-03 19:58 ` [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Luck, Tony
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Refactor show_rdt_tasks() to use a new rdt_task_match() helper that checks
t->plza when PLZA is enabled for a group, falling back to CLOSID/RMID
matching otherwise. This ensures correct task display for PLZA-enabled
groups.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
fs/resctrl/rdtgroup.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index bea017f9bd40..a116daa53f17 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -845,6 +845,15 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
return ret ?: nbytes;
}
+static inline bool rdt_task_match(struct task_struct *t,
+ struct rdtgroup *r, bool plza)
+{
+ if (plza)
+ return t->plza;
+
+ return is_closid_match(t, r) || is_rmid_match(t, r);
+}
+
static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
{
struct task_struct *p, *t;
@@ -852,11 +861,12 @@ static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
rcu_read_lock();
for_each_process_thread(p, t) {
- if (is_closid_match(t, r) || is_rmid_match(t, r)) {
- pid = task_pid_vnr(t);
- if (pid)
- seq_printf(s, "%d\n", pid);
- }
+ if (!rdt_task_match(t, r, r->plza))
+ continue;
+
+ pid = task_pid_vnr(t);
+ if (pid)
+ seq_printf(s, "%d\n", pid);
}
rcu_read_unlock();
}
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [RFC PATCH 19/19] fs/resctrl: Add per-task PLZA enable support via rdtgroup
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (17 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 18/19] x86/resctrl: Refactor show_rdt_tasks() to support PLZA task matching Babu Moger
@ 2026-01-21 21:12 ` Babu Moger
2026-02-03 19:58 ` [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Luck, Tony
19 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-21 21:12 UTC (permalink / raw)
To: corbet, tony.luck, reinette.chatre, Dave.Martin, james.morse,
babu.moger, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Introduce support for enabling PLZA on a per-task basis through the resctrl
control-group interface.
Add an architecture helper to set the PLZA state in the task structure and
extend the rdtgroup task handling path to apply PLZA when associating a
task with a control group. PLZA can only be enabled for control groups;
attempts to enable it for monitoring groups are rejected.
Proper memory ordering is enforced to ensure that task closid and rmid
updates are visible before determining whether the task is currently
running. If the task is active on a CPU, the relevant MSRs are updated
immediately; otherwise, PLZA state is programmed on the next context
switch.
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/include/asm/resctrl.h | 5 +++
fs/resctrl/rdtgroup.c | 69 +++++++++++++++++++++++++++++++++-
2 files changed, 73 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 89b38948be1a..2c11787c5253 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -200,6 +200,11 @@ static inline void resctrl_arch_set_cpu_plza(int cpu, u32 closid, u32 rmid, u32
WRITE_ONCE(per_cpu(pqr_state.plza_rmid, cpu), rmid);
}
+static inline void resctrl_arch_set_task_plza(struct task_struct *tsk, u32 enable)
+{
+ WRITE_ONCE(tsk->plza, enable);
+}
+
static inline void resctrl_arch_sched_in(struct task_struct *tsk)
{
if (static_branch_likely(&rdt_enable_key))
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a116daa53f17..5ec10f07cbf2 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -706,6 +706,26 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
return 0;
}
+static int __rdtgroup_plza_task(struct task_struct *tsk,
+ struct rdtgroup *rdtgrp)
+{
+ if (rdtgrp->type != RDTCTRL_GROUP) {
+ rdt_last_cmd_puts("Can't set PLZA on MON group\n");
+ return -EINVAL;
+ }
+
+ resctrl_arch_set_task_plza(tsk, 1);
+
+ /*
+ * Order the task's plza state stores above before the loads in
+ * task_curr(). This pairs with the full barrier between the
+ * rq->curr update and resctrl_arch_sched_in() during context switch.
+ */
+ smp_mb();
+
+ return 0;
+}
+
static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
{
return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) &&
@@ -795,6 +815,35 @@ static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
return ret;
}
+static int rdtgroup_plza_task(pid_t pid, struct rdtgroup *rdtgrp,
+ struct kernfs_open_file *of)
+{
+ struct task_struct *tsk;
+ int ret;
+
+ rcu_read_lock();
+ if (pid) {
+ tsk = find_task_by_vpid(pid);
+ if (!tsk) {
+ rcu_read_unlock();
+ rdt_last_cmd_printf("No task %d\n", pid);
+ return -ESRCH;
+ }
+ } else {
+ tsk = current;
+ }
+
+ get_task_struct(tsk);
+ rcu_read_unlock();
+
+ ret = rdtgroup_task_write_permission(tsk, of);
+ if (!ret)
+ ret = __rdtgroup_plza_task(tsk, rdtgrp);
+
+ put_task_struct(tsk);
+ return ret;
+}
+
static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -832,7 +881,10 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
break;
}
- ret = rdtgroup_move_task(pid, rdtgrp, of);
+ if (rdtgrp->plza)
+ ret = rdtgroup_plza_task(pid, rdtgrp, of);
+ else
+ ret = rdtgroup_move_task(pid, rdtgrp, of);
if (ret) {
rdt_last_cmd_printf("Error while processing task %d\n", pid);
break;
@@ -935,6 +987,19 @@ static int rdtgroup_plza_show(struct kernfs_open_file *of,
return ret;
}
+static void rdt_task_set_plza(struct rdtgroup *r, bool plza)
+{
+ struct task_struct *p, *t;
+
+ rcu_read_lock();
+ for_each_process_thread(p, t) {
+ if (!rdt_task_match(t, r, r->plza))
+ continue;
+ resctrl_arch_set_task_plza(t, plza);
+ }
+ rcu_read_unlock();
+}
+
static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -991,6 +1056,7 @@ static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
/* Enable or disable PLZA state and update per CPU state if there is a change */
if (enable != rdtgrp->plza) {
+ rdt_task_set_plza(rdtgrp, enable);
resctrl_arch_plza_setup(r, rdtgrp->closid, rdtgrp->mon.rmid);
for_each_cpu(cpu, &rdtgrp->cpu_mask)
@@ -4209,6 +4275,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
int cpu;
if (rdtgrp->plza) {
+ rdt_task_set_plza(rdtgrp, false);
for_each_cpu(cpu, &rdtgrp->cpu_mask)
resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
rdtgrp->mon.rmid, false);
--
2.34.1
^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-21 21:12 ` [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling Babu Moger
@ 2026-01-27 22:30 ` Luck, Tony
2026-01-28 16:01 ` Moger, Babu
0 siblings, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-01-27 22:30 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
On Wed, Jan 21, 2026 at 03:12:51PM -0600, Babu Moger wrote:
> @@ -138,6 +143,20 @@ static inline void __resctrl_sched_in(struct task_struct *tsk)
> state->cur_rmid = rmid;
> wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
> }
> +
> + if (static_branch_likely(&rdt_plza_enable_key)) {
> + tmp = READ_ONCE(tsk->plza);
> + if (tmp)
> + plza = tmp;
> +
> + if (plza != state->cur_plza) {
> + state->cur_plza = plza;
> + wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
> + RMID_EN | state->plza_rmid,
> + (plza ? PLZA_EN : 0) | CLOSID_EN | state->plza_closid);
> + }
> + }
> +
Babu,
This addition to the context switch code surprised me. After your talk
at LPC I had imagined that PLZA would be a single global setting so that
every syscall/page-fault/interrupt would run with a different CLOSID
(presumably one configured with more cache and memory bandwidth).
But this patch series looks like things are more flexible with the
ability to set different values (of RMID as well as CLOSID) per group.
It looks like it is possible to have some resctrl group with very
limited resources just bump up a bit when in ring0, while other
groups may get some different amount.
The additions for plza to the Documentation aren't helping me
understand how users will apply this.
Do you have some more examples?
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-27 22:30 ` Luck, Tony
@ 2026-01-28 16:01 ` Moger, Babu
2026-01-28 17:12 ` Luck, Tony
2026-02-12 10:00 ` Ben Horgan
0 siblings, 2 replies; 114+ messages in thread
From: Moger, Babu @ 2026-01-28 16:01 UTC (permalink / raw)
To: Luck, Tony, Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
Hi Tony,
Thanks for the comment.
On 1/27/2026 4:30 PM, Luck, Tony wrote:
> On Wed, Jan 21, 2026 at 03:12:51PM -0600, Babu Moger wrote:
>> @@ -138,6 +143,20 @@ static inline void __resctrl_sched_in(struct task_struct *tsk)
>> state->cur_rmid = rmid;
>> wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
>> }
>> +
>> + if (static_branch_likely(&rdt_plza_enable_key)) {
>> + tmp = READ_ONCE(tsk->plza);
>> + if (tmp)
>> + plza = tmp;
>> +
>> + if (plza != state->cur_plza) {
>> + state->cur_plza = plza;
>> + wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
>> + RMID_EN | state->plza_rmid,
>> + (plza ? PLZA_EN : 0) | CLOSID_EN | state->plza_closid);
>> + }
>> + }
>> +
>
> Babu,
>
> This addition to the context switch code surprised me. After your talk
> at LPC I had imagined that PLZA would be a single global setting so that
> every syscall/page-fault/interrupt would run with a different CLOSID
> (presumably one configured with more cache and memory bandwidth).
>
> But this patch series looks like things are more flexible with the
> ability to set different values (of RMID as well as CLOSID) per group.
Yes. this similar what we have with MSR_IA32_PQR_ASSOC. The association
can be done either thru CPUs (just one MSR write) or task based
association(more MSR write as task moves around).
>
> It looks like it is possible to have some resctrl group with very
> limited resources just bump up a bit when in ring0, while other
> groups may get some different amount.
>
> The additions for plza to the Documentation aren't helping me
> understand how users will apply this.
>
> Do you have some more examples?
Group creation is similar to what we have currently.
1. create a regular group and setup the limits.
# mkdir /sys/fs/resctrl/group
2. Assign tasks or CPUs.
# echo 1234 > /sys/fs/resctrl/group/tasks
This is a regular group.
3. Now you figured that you need to change things in CPL0 for this task.
4. Now create a PLZA group now and tweek the limits,
# mkdir /sys/fs/resctrl/group1
# echo 1 > /sys/fs/resctrl/group1/plza
# echo "MB:0=100" > /sys/fs/resctrl/group1/schemata
5. Assign the same task to the plza group.
# echo 1234 > /sys/fs/resctrl/group1/tasks
Now the task 1234 will be using the limits from group1 when running in
CPL0.
I will add few more details in my next revision.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-28 16:01 ` Moger, Babu
@ 2026-01-28 17:12 ` Luck, Tony
2026-01-28 17:41 ` Moger, Babu
2026-02-12 10:00 ` Ben Horgan
1 sibling, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-01-28 17:12 UTC (permalink / raw)
To: Moger, Babu
Cc: Babu Moger, corbet, reinette.chatre, Dave.Martin, james.morse,
tglx, mingo, bp, dave.hansen, x86, hpa, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, akpm, pawan.kumar.gupta, pmladek, feng.tang, kees, arnd,
fvdl, lirongqing, bhelgaas, seanjc, xin, manali.shukla,
dapeng1.mi, chang.seok.bae, mario.limonciello, naveen,
elena.reshetova, thomas.lendacky, linux-doc, linux-kernel, kvm,
peternewman, eranian, gautham.shenoy
On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> Hi Tony,
>
> Thanks for the comment.
>
> On 1/27/2026 4:30 PM, Luck, Tony wrote:
> > On Wed, Jan 21, 2026 at 03:12:51PM -0600, Babu Moger wrote:
> > > @@ -138,6 +143,20 @@ static inline void __resctrl_sched_in(struct task_struct *tsk)
> > > state->cur_rmid = rmid;
> > > wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
> > > }
> > > +
> > > + if (static_branch_likely(&rdt_plza_enable_key)) {
> > > + tmp = READ_ONCE(tsk->plza);
> > > + if (tmp)
> > > + plza = tmp;
> > > +
> > > + if (plza != state->cur_plza) {
> > > + state->cur_plza = plza;
> > > + wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
> > > + RMID_EN | state->plza_rmid,
> > > + (plza ? PLZA_EN : 0) | CLOSID_EN | state->plza_closid);
> > > + }
> > > + }
> > > +
> >
> > Babu,
> >
> > This addition to the context switch code surprised me. After your talk
> > at LPC I had imagined that PLZA would be a single global setting so that
> > every syscall/page-fault/interrupt would run with a different CLOSID
> > (presumably one configured with more cache and memory bandwidth).
> >
> > But this patch series looks like things are more flexible with the
> > ability to set different values (of RMID as well as CLOSID) per group.
>
> Yes. this similar what we have with MSR_IA32_PQR_ASSOC. The association can
> be done either thru CPUs (just one MSR write) or task based association(more
> MSR write as task moves around).
> >
> > It looks like it is possible to have some resctrl group with very
> > limited resources just bump up a bit when in ring0, while other
> > groups may get some different amount.
> >
> > The additions for plza to the Documentation aren't helping me
> > understand how users will apply this.
> >
> > Do you have some more examples?
>
> Group creation is similar to what we have currently.
>
> 1. create a regular group and setup the limits.
> # mkdir /sys/fs/resctrl/group
>
> 2. Assign tasks or CPUs.
> # echo 1234 > /sys/fs/resctrl/group/tasks
>
> This is a regular group.
>
> 3. Now you figured that you need to change things in CPL0 for this task.
>
> 4. Now create a PLZA group now and tweek the limits,
>
> # mkdir /sys/fs/resctrl/group1
>
> # echo 1 > /sys/fs/resctrl/group1/plza
>
> # echo "MB:0=100" > /sys/fs/resctrl/group1/schemata
>
> 5. Assign the same task to the plza group.
>
> # echo 1234 > /sys/fs/resctrl/group1/tasks
>
>
> Now the task 1234 will be using the limits from group1 when running in CPL0.
>
> I will add few more details in my next revision.
>
Babu,
I've read a bit more of the code now and I think I understand more.
Some useful additions to your explanation.
1) Only one CTRL group can be marked as PLZA
2) It can't be the root/default group
3) It can't have sub monitor groups
4) It can't be pseudo-locked
Would a potential use case involve putting *all* tasks into the PLZA
group? That would avoid any additional context switch overhead as the
PLZA MSR would never need to change.
If that is the case, maybe for the PLZA group we should allow user to
do:
# echo '*' > tasks
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-28 17:12 ` Luck, Tony
@ 2026-01-28 17:41 ` Moger, Babu
2026-01-28 17:44 ` Moger, Babu
0 siblings, 1 reply; 114+ messages in thread
From: Moger, Babu @ 2026-01-28 17:41 UTC (permalink / raw)
To: Luck, Tony, Moger, Babu
Cc: corbet@lwn.net, reinette.chatre@intel.com, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
[AMD Official Use Only - AMD Internal Distribution Only]
> -----Original Message-----
> From: Luck, Tony <tony.luck@intel.com>
> Sent: Wednesday, January 28, 2026 11:12 AM
> To: Moger, Babu <bmoger@amd.com>
> Cc: Babu Moger <babu.moger@amd.com>; corbet@lwn.net;
> reinette.chatre@intel.com; Dave.Martin@arm.com; james.morse@arm.com;
> tglx@kernel.org; mingo@redhat.com; bp@alien8.de; dave.hansen@linux.intel.com;
> x86@kernel.org; hpa@zytor.com; peterz@infradead.org; juri.lelli@redhat.com;
> vincent.guittot@linaro.org; dietmar.eggemann@arm.com; rostedt@goodmis.org;
> bsegall@google.com; mgorman@suse.de; vschneid@redhat.com; akpm@linux-
> foundation.org; pawan.kumar.gupta@linux.intel.com; pmladek@suse.com;
> feng.tang@linux.alibaba.com; kees@kernel.org; arnd@arndb.de; fvdl@google.com;
> lirongqing@baidu.com; bhelgaas@google.com; seanjc@google.com;
> xin@zytor.com; manali.shukla@amd.com; dapeng1.mi@linux.intel.com;
> chang.seok.bae@intel.com; mario.limonciello@amd.com; naveen@kernel.org;
> elena.reshetova@intel.com; thomas.lendacky@amd.com; linux-
> doc@vger.kernel.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> peternewman@google.com; eranian@google.com; gautham.shenoy@amd.com
> Subject: Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context
> switch handling
>
> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> > Hi Tony,
> >
> > Thanks for the comment.
> >
> > On 1/27/2026 4:30 PM, Luck, Tony wrote:
> > > On Wed, Jan 21, 2026 at 03:12:51PM -0600, Babu Moger wrote:
> > > > @@ -138,6 +143,20 @@ static inline void __resctrl_sched_in(struct
> task_struct *tsk)
> > > > state->cur_rmid = rmid;
> > > > wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
> > > > }
> > > > +
> > > > + if (static_branch_likely(&rdt_plza_enable_key)) {
> > > > + tmp = READ_ONCE(tsk->plza);
> > > > + if (tmp)
> > > > + plza = tmp;
> > > > +
> > > > + if (plza != state->cur_plza) {
> > > > + state->cur_plza = plza;
> > > > + wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
> > > > + RMID_EN | state->plza_rmid,
> > > > + (plza ? PLZA_EN : 0) | CLOSID_EN | state-
> >plza_closid);
> > > > + }
> > > > + }
> > > > +
> > >
> > > Babu,
> > >
> > > This addition to the context switch code surprised me. After your
> > > talk at LPC I had imagined that PLZA would be a single global
> > > setting so that every syscall/page-fault/interrupt would run with a
> > > different CLOSID (presumably one configured with more cache and memory
> bandwidth).
> > >
> > > But this patch series looks like things are more flexible with the
> > > ability to set different values (of RMID as well as CLOSID) per group.
> >
> > Yes. this similar what we have with MSR_IA32_PQR_ASSOC. The
> > association can be done either thru CPUs (just one MSR write) or task
> > based association(more MSR write as task moves around).
> > >
> > > It looks like it is possible to have some resctrl group with very
> > > limited resources just bump up a bit when in ring0, while other
> > > groups may get some different amount.
> > >
> > > The additions for plza to the Documentation aren't helping me
> > > understand how users will apply this.
> > >
> > > Do you have some more examples?
> >
> > Group creation is similar to what we have currently.
> >
> > 1. create a regular group and setup the limits.
> > # mkdir /sys/fs/resctrl/group
> >
> > 2. Assign tasks or CPUs.
> > # echo 1234 > /sys/fs/resctrl/group/tasks
> >
> > This is a regular group.
> >
> > 3. Now you figured that you need to change things in CPL0 for this task.
> >
> > 4. Now create a PLZA group now and tweek the limits,
> >
> > # mkdir /sys/fs/resctrl/group1
> >
> > # echo 1 > /sys/fs/resctrl/group1/plza
> >
> > # echo "MB:0=100" > /sys/fs/resctrl/group1/schemata
> >
> > 5. Assign the same task to the plza group.
> >
> > # echo 1234 > /sys/fs/resctrl/group1/tasks
> >
> >
> > Now the task 1234 will be using the limits from group1 when running in CPL0.
> >
> > I will add few more details in my next revision.
> >
>
> Babu,
>
> I've read a bit more of the code now and I think I understand more.
>
> Some useful additions to your explanation.
>
> 1) Only one CTRL group can be marked as PLZA
Yes. Correct.
> 2) It can't be the root/default group
This is something I added to keep the default group in a un-disturbed,
> 3) It can't have sub monitor groups
> 4) It can't be pseudo-locked
Yes.
>
> Would a potential use case involve putting *all* tasks into the PLZA group? That
> would avoid any additional context switch overhead as the PLZA MSR would never
> need to change.
Yes. That can be one use case.
>
> If that is the case, maybe for the PLZA group we should allow user to
> do:
>
> # echo '*' > tasks
Yea. It can be done.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: RE: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-28 17:41 ` Moger, Babu
@ 2026-01-28 17:44 ` Moger, Babu
2026-01-28 19:17 ` Luck, Tony
2026-02-10 16:17 ` Reinette Chatre
0 siblings, 2 replies; 114+ messages in thread
From: Moger, Babu @ 2026-01-28 17:44 UTC (permalink / raw)
To: Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, reinette.chatre@intel.com, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
On 1/28/2026 11:41 AM, Moger, Babu wrote:
>
Outlook adds AMD header. Removed it now.
>> -----Original Message-----
>> From: Luck, Tony <tony.luck@intel.com>
>> Sent: Wednesday, January 28, 2026 11:12 AM
>> To: Moger, Babu <bmoger@amd.com>
>> Cc: Babu Moger <babu.moger@amd.com>; corbet@lwn.net;
>> reinette.chatre@intel.com; Dave.Martin@arm.com; james.morse@arm.com;
>> tglx@kernel.org; mingo@redhat.com; bp@alien8.de; dave.hansen@linux.intel.com;
>> x86@kernel.org; hpa@zytor.com; peterz@infradead.org; juri.lelli@redhat.com;
>> vincent.guittot@linaro.org; dietmar.eggemann@arm.com; rostedt@goodmis.org;
>> bsegall@google.com; mgorman@suse.de; vschneid@redhat.com; akpm@linux-
>> foundation.org; pawan.kumar.gupta@linux.intel.com; pmladek@suse.com;
>> feng.tang@linux.alibaba.com; kees@kernel.org; arnd@arndb.de; fvdl@google.com;
>> lirongqing@baidu.com; bhelgaas@google.com; seanjc@google.com;
>> xin@zytor.com; manali.shukla@amd.com; dapeng1.mi@linux.intel.com;
>> chang.seok.bae@intel.com; mario.limonciello@amd.com; naveen@kernel.org;
>> elena.reshetova@intel.com; thomas.lendacky@amd.com; linux-
>> doc@vger.kernel.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
>> peternewman@google.com; eranian@google.com; gautham.shenoy@amd.com
>> Subject: Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context
>> switch handling
>>
>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>> Hi Tony,
>>>
>>> Thanks for the comment.
>>>
>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>> On Wed, Jan 21, 2026 at 03:12:51PM -0600, Babu Moger wrote:
>>>>> @@ -138,6 +143,20 @@ static inline void __resctrl_sched_in(struct
>> task_struct *tsk)
>>>>> state->cur_rmid = rmid;
>>>>> wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
>>>>> }
>>>>> +
>>>>> + if (static_branch_likely(&rdt_plza_enable_key)) {
>>>>> + tmp = READ_ONCE(tsk->plza);
>>>>> + if (tmp)
>>>>> + plza = tmp;
>>>>> +
>>>>> + if (plza != state->cur_plza) {
>>>>> + state->cur_plza = plza;
>>>>> + wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
>>>>> + RMID_EN | state->plza_rmid,
>>>>> + (plza ? PLZA_EN : 0) | CLOSID_EN | state-
>>> plza_closid);
>>>>> + }
>>>>> + }
>>>>> +
>>>>
>>>> Babu,
>>>>
>>>> This addition to the context switch code surprised me. After your
>>>> talk at LPC I had imagined that PLZA would be a single global
>>>> setting so that every syscall/page-fault/interrupt would run with a
>>>> different CLOSID (presumably one configured with more cache and memory
>> bandwidth).
>>>>
>>>> But this patch series looks like things are more flexible with the
>>>> ability to set different values (of RMID as well as CLOSID) per group.
>>>
>>> Yes. this similar what we have with MSR_IA32_PQR_ASSOC. The
>>> association can be done either thru CPUs (just one MSR write) or task
>>> based association(more MSR write as task moves around).
>>>>
>>>> It looks like it is possible to have some resctrl group with very
>>>> limited resources just bump up a bit when in ring0, while other
>>>> groups may get some different amount.
>>>>
>>>> The additions for plza to the Documentation aren't helping me
>>>> understand how users will apply this.
>>>>
>>>> Do you have some more examples?
>>>
>>> Group creation is similar to what we have currently.
>>>
>>> 1. create a regular group and setup the limits.
>>> # mkdir /sys/fs/resctrl/group
>>>
>>> 2. Assign tasks or CPUs.
>>> # echo 1234 > /sys/fs/resctrl/group/tasks
>>>
>>> This is a regular group.
>>>
>>> 3. Now you figured that you need to change things in CPL0 for this task.
>>>
>>> 4. Now create a PLZA group now and tweek the limits,
>>>
>>> # mkdir /sys/fs/resctrl/group1
>>>
>>> # echo 1 > /sys/fs/resctrl/group1/plza
>>>
>>> # echo "MB:0=100" > /sys/fs/resctrl/group1/schemata
>>>
>>> 5. Assign the same task to the plza group.
>>>
>>> # echo 1234 > /sys/fs/resctrl/group1/tasks
>>>
>>>
>>> Now the task 1234 will be using the limits from group1 when running in CPL0.
>>>
>>> I will add few more details in my next revision.
>>>
>>
>> Babu,
>>
>> I've read a bit more of the code now and I think I understand more.
>>
>> Some useful additions to your explanation.
>>
>> 1) Only one CTRL group can be marked as PLZA
>
> Yes. Correct.
>
>> 2) It can't be the root/default group
>
> This is something I added to keep the default group in a un-disturbed,
>
>> 3) It can't have sub monitor groups
>> 4) It can't be pseudo-locked
>
> Yes.
>
>>
>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>> would avoid any additional context switch overhead as the PLZA MSR would never
>> need to change.
>
> Yes. That can be one use case.
>
>>
>> If that is the case, maybe for the PLZA group we should allow user to
>> do:
>>
>> # echo '*' > tasks
>
> Yea. It can be done.
>
> Thanks
> Babu
>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: RE: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-28 17:44 ` Moger, Babu
@ 2026-01-28 19:17 ` Luck, Tony
2026-02-10 16:17 ` Reinette Chatre
1 sibling, 0 replies; 114+ messages in thread
From: Luck, Tony @ 2026-01-28 19:17 UTC (permalink / raw)
To: Moger, Babu
Cc: Moger, Babu, corbet@lwn.net, reinette.chatre@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, peterz@infradead.org,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
mgorman@suse.de, vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
> > > Would a potential use case involve putting *all* tasks into the PLZA group? That
> > > would avoid any additional context switch overhead as the PLZA MSR would never
> > > need to change.
> >
> > Yes. That can be one use case.
Are there other use cases? I think the prime motivation for PLZA is to
avoid priority inversions where some task is running with restricted
resources, and does a system call, or takes an interrupt, and the kernel
is stuck with those limited resources - which may delay context switch
to a high priority process.
That thinking leads to "run all ring 0 code with more resources".
Do you see use cases where you'd like to see the low priority tasks
bumped up to some higher level of resources (but perhaps not full
access). While medium priority tasks keep same resources when entering
ring0.
I think there is some slight oddity if a resctrl group that already
has some tasks is made into the PLZA group.
The existing tasks get marked as PLZA enabled. So run with the same
resources in ring0 and ring3. But you can't add a new task to this
group with that same property. A newly added task retains its ring3
resources by remaining in its original group. It only gets the PLZA
treatment when transitioning to ring0.
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group
2026-01-21 21:12 ` [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group Babu Moger
@ 2026-01-28 22:03 ` Luck, Tony
2026-01-29 18:54 ` Luck, Tony
2026-01-29 19:42 ` Babu Moger
2026-02-10 0:05 ` Reinette Chatre
1 sibling, 2 replies; 114+ messages in thread
From: Luck, Tony @ 2026-01-28 22:03 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
On Wed, Jan 21, 2026 at 03:12:54PM -0600, Babu Moger wrote:
> Introduce rdtgroup_plza_write() group which enables per group control of
> PLZA through the resctrl filesystem and ensure that enabling or disabling
> PLZA is propagated consistently across all CPUs belonging to the group.
>
> Enforce the capability checks, exclude default, pseudo-locked and CTRL_MON
> groups with sub monitors. Also, ensure that only one group can have PLZA
> enabled at a time.
>
...
> +static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off)
> +{
> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> + struct rdtgroup *rdtgrp, *prgrp;
> + int cpu, ret = 0;
> + bool enable;
...
> + /* Enable or disable PLZA state and update per CPU state if there is a change */
> + if (enable != rdtgrp->plza) {
> + resctrl_arch_plza_setup(r, rdtgrp->closid, rdtgrp->mon.rmid);
What is this for? If I've just created a group with no tasks, and empty
CPU mask ... it seems that this writes the MSR_IA32_PQR_PLZA_ASSOC on
every CPU in every domain.
> + for_each_cpu(cpu, &rdtgrp->cpu_mask)
> + resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
> + rdtgrp->mon.rmid, enable);
> + rdtgrp->plza = enable;
> + }
> +
> +unlock:
> + rdtgroup_kn_unlock(of->kn);
> +
> + return ret ?: nbytes;
> +}
It also appears that marking a task as PLZA is permanent. Moving it to
another group doesn't unmark it. Is this intentional?
# mkdir group1 group2 plza_group
# echo 1 > plza_group/plza
# echo $$ > group1/tasks
# echo $$ > plza_group/tasks
My shell is now in group1 and in the plza_group
# grep $$ */tasks
group1/tasks:4125
plza_group/tasks:4125
Move shell to group2
# echo $$ > group2/tasks
# grep $$ */tasks
group2/tasks:4125
plza_group/tasks:4125
Succcess in moving to group2, but still in plza_group
-Tony
N.B. I don't have a PLZA enabled system. So I faked it with this
patch.
From 1655fea0049947218fa5400916d57109be8521ef Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Wed, 28 Jan 2026 13:02:51 -0800
Subject: [PATCH] fake PLZA
---
arch/x86/include/asm/resctrl.h | 10 ++++++----
arch/x86/kernel/cpu/resctrl/core.c | 4 ++--
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 3 ++-
3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 2c11787c5253..7ee35bebb64c 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -90,14 +90,16 @@ static inline void resctrl_arch_disable_mon(void)
static inline void resctrl_arch_enable_plza(void)
{
- static_branch_enable_cpuslocked(&rdt_plza_enable_key);
- static_branch_inc_cpuslocked(&rdt_enable_key);
+ pr_info("resctrl_arch_enable_plza\n");
+ //static_branch_enable_cpuslocked(&rdt_plza_enable_key);
+ //static_branch_inc_cpuslocked(&rdt_enable_key);
}
static inline void resctrl_arch_disable_plza(void)
{
- static_branch_disable_cpuslocked(&rdt_plza_enable_key);
- static_branch_dec_cpuslocked(&rdt_enable_key);
+ pr_info("resctrl_arch_disable_plza\n");
+ //static_branch_disable_cpuslocked(&rdt_plza_enable_key);
+ //static_branch_dec_cpuslocked(&rdt_enable_key);
}
/*
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index e41fe5fa3f30..780cdfb0e7cd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -295,7 +295,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
r->alloc_capable = true;
- if (rdt_cpu_has(X86_FEATURE_PLZA))
+ if (1 || rdt_cpu_has(X86_FEATURE_PLZA))
r->plza_capable = true;
return true;
@@ -318,7 +318,7 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
r->cache.arch_has_sparse_bitmasks = ecx.split.noncont;
r->alloc_capable = true;
- if (rdt_cpu_has(X86_FEATURE_PLZA))
+ if (1 || rdt_cpu_has(X86_FEATURE_PLZA))
r->plza_capable = true;
}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 79ed41bde810..24a37ebed13a 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -136,7 +136,8 @@ static void resctrl_plza_set_one_amd(void *arg)
{
union qos_pqr_plza_assoc *plza = arg;
- wrmsrl(MSR_IA32_PQR_PLZA_ASSOC, plza->full);
+ pr_info("wrmsr(MSR_IA32_PQR_PLZA_ASSOC, 0x%lx)\n", plza->full);
+ //wrmsrl(MSR_IA32_PQR_PLZA_ASSOC, plza->full);
}
void resctrl_arch_plza_setup(struct rdt_resource *r, u32 closid, u32 rmid)
--
2.52.0
^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group
2026-01-28 22:03 ` Luck, Tony
@ 2026-01-29 18:54 ` Luck, Tony
2026-01-29 19:31 ` Babu Moger
2026-01-29 19:42 ` Babu Moger
1 sibling, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-01-29 18:54 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
On Wed, Jan 28, 2026 at 02:03:31PM -0800, Luck, Tony wrote:
> On Wed, Jan 21, 2026 at 03:12:54PM -0600, Babu Moger wrote:
> > Introduce rdtgroup_plza_write() group which enables per group control of
> > PLZA through the resctrl filesystem and ensure that enabling or disabling
> > PLZA is propagated consistently across all CPUs belonging to the group.
> >
> > Enforce the capability checks, exclude default, pseudo-locked and CTRL_MON
> > groups with sub monitors. Also, ensure that only one group can have PLZA
> > enabled at a time.
> >
> ...
>
> > +static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
> > + size_t nbytes, loff_t off)
> > +{
> > + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> > + struct rdtgroup *rdtgrp, *prgrp;
> > + int cpu, ret = 0;
> > + bool enable;
>
> ...
>
> > + /* Enable or disable PLZA state and update per CPU state if there is a change */
> > + if (enable != rdtgrp->plza) {
> > + resctrl_arch_plza_setup(r, rdtgrp->closid, rdtgrp->mon.rmid);
>
> What is this for? If I've just created a group with no tasks, and empty
> CPU mask ... it seems that this writes the MSR_IA32_PQR_PLZA_ASSOC on
> every CPU in every domain.
I think I see now. There are THREE enable bits in your
MSR_IA32_PQR_PLZA_ASSOC.
One each for CLOSID and RMID, and an overall PLZA_EN in the high bit.
At this step you setup the CLOSID/RMID with their enable bits, but
leaving the PLZA_EN off.
Is this a subtle optimzation for the context switch? Is the WRMSR
faster if it only toggle PLZA_EN leaving all the other bits unchanged?
This might not be working as expected. The context switch code does:
wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
RMID_EN | state->plza_rmid,
(plza ? PLZA_EN : 0) | CLOSID_EN | state->plza_closid);
This doesn't just clear the PLZA_EN bit, it zeroes the high dword of the MSR.
> It also appears that marking a task as PLZA is permanent. Moving it to
> another group doesn't unmark it. Is this intentional?
Ditto assigning a CPU to the PLZA group. Once done it can't be undone
(except by turing off PLZA?).
-Tony
[More comments about this coming against patch 16]
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 14/19] x86,fs/resctrl: Add the functionality to configure PLZA
2026-01-21 21:12 ` [RFC PATCH 14/19] x86,fs/resctrl: Add the functionality to configure PLZA Babu Moger
@ 2026-01-29 19:13 ` Luck, Tony
2026-01-29 19:53 ` Babu Moger
0 siblings, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-01-29 19:13 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
On Wed, Jan 21, 2026 at 03:12:52PM -0600, Babu Moger wrote:
> Privilege Level Zero Association (PLZA) is configured by writing to
> MSR_IA32_PQR_PLZA_ASSOC. PLZA is disabled by default on all logical
> processors in the QOS Domain. System software must follow the following
> sequence.
>
> 1. Set the closid, closid_en, rmid and rmid_en fields of
> MSR_IA32_PQR_PLZA_ASSOC to the desired configuration on all logical
> processors in the QOS Domain.
>
> 2. Set MSR_IA32_PQR_PLZA_ASSOC[PLZA_EN]=1 for
> all logical processors in the QOS domain where PLZA should be enabled.
>
> MSR_IA32_PQR_PLZA_ASSOC[PLZA_EN] may have a different value on every
> logical processor in the QOS domain. The system software should perform
> this as a read-modify-write to avoid changing the value of closid_en,
> closid, rmid_en, and rmid fields of MSR_IA32_PQR_PLZA_ASSOC.
Architecturally this is true. But in the implementation for resctrl
there is only one PLZA group. So the CLOSID and RMID fields are
identical on every logical processor. The only changing bit is the
PLZA_EN.
The code could be simpler if you just maintained a single global
with the CLOSID/RMID bits initialized by resctrl_arch_plza_setup().
union qos_pqr_plza_assoc plza_value; // needs a better name
Change the PLZA_EN define to be
#define PLZA_EN BIT_ULL(63)
and then the hook into the __resctrl_sched_in() becomes:
if (static_branch_likely(&rdt_plza_enable_key)) {
u32 plza = READ_ONCE(state->default_plza); // note, moved this inside the static branch
tmp = READ_ONCE(tsk->plza);
if (tmp)
plza = tmp;
if (plza != state->cur_plza) {
state->cur_plza = plza;
wrmsrq(MSR_IA32_PQR_PLZA_ASSOC,
(plza ? PLZA_EN : 0) | plza_value.full);
}
}
[Earlier e-mail about clearing the high half of MSR_IA32_PQR_PLZA_ASSOC
was wrong. My debug trace printed the wrong value. The argument to the
wrmsrl() is correct].
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group
2026-01-29 18:54 ` Luck, Tony
@ 2026-01-29 19:31 ` Babu Moger
0 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-29 19:31 UTC (permalink / raw)
To: Luck, Tony, Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
Hi Tony,
On 1/29/26 12:54, Luck, Tony wrote:
> On Wed, Jan 28, 2026 at 02:03:31PM -0800, Luck, Tony wrote:
>> On Wed, Jan 21, 2026 at 03:12:54PM -0600, Babu Moger wrote:
>>> Introduce rdtgroup_plza_write() group which enables per group control of
>>> PLZA through the resctrl filesystem and ensure that enabling or disabling
>>> PLZA is propagated consistently across all CPUs belonging to the group.
>>>
>>> Enforce the capability checks, exclude default, pseudo-locked and CTRL_MON
>>> groups with sub monitors. Also, ensure that only one group can have PLZA
>>> enabled at a time.
>>>
>> ...
>>
>>> +static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
>>> + size_t nbytes, loff_t off)
>>> +{
>>> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>>> + struct rdtgroup *rdtgrp, *prgrp;
>>> + int cpu, ret = 0;
>>> + bool enable;
>> ...
>>
>>> + /* Enable or disable PLZA state and update per CPU state if there is a change */
>>> + if (enable != rdtgrp->plza) {
>>> + resctrl_arch_plza_setup(r, rdtgrp->closid, rdtgrp->mon.rmid);
>> What is this for? If I've just created a group with no tasks, and empty
>> CPU mask ... it seems that this writes the MSR_IA32_PQR_PLZA_ASSOC on
>> every CPU in every domain.
Here is the reason.
Some fields of PQR_PLZA_ASSOC must be set to the same value for all HW
threads in the QOS domain for consistent operation (Per-QosDomain).
The user should use the following sequence to set these values to a
consistent state.
1.
Set PQR_PLZA_ASSOC[PLZA_EN]=0 for all HW threads in the QOS Domain
2.
Set the COS_EN, COS, RMID_EN, and RMID fields of PQR_PLZA_ASSOC to
the desired configuration on all HW threads in the QOS Domain
3.
Set PQR_PLZA_ASSOC[PLZA_EN]=1 for all HW threads in the QOS Domain
where PLZA should be enabled.
*
The user should perform this as a read-modify-write to avoid
changing the value of COS_EN, COS, RMID_EN, and RMID fields of
PQR_PLZA_ASSOC.
Basically, we have to set all the fields to consistent state to setup
the PLZA first. Then setup PLZA_EN bit on each thread based on current
association.
> I think I see now. There are THREE enable bits in your
> MSR_IA32_PQR_PLZA_ASSOC.
> One each for CLOSID and RMID, and an overall PLZA_EN in the high bit.
>
> At this step you setup the CLOSID/RMID with their enable bits, but
> leaving the PLZA_EN off.
>
> Is this a subtle optimzation for the context switch? Is the WRMSR
> faster if it only toggle PLZA_EN leaving all the other bits unchanged?
I really did not think of optimization here. Mostly followed the spec.
>
> This might not be working as expected. The context switch code does:
>
> wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
> RMID_EN | state->plza_rmid,
> (plza ? PLZA_EN : 0) | CLOSID_EN | state->plza_closid);
>
> This doesn't just clear the PLZA_EN bit, it zeroes the high dword of the MSR.
>
>> It also appears that marking a task as PLZA is permanent. Moving it to
>> another group doesn't unmark it. Is this intentional?
> Ditto assigning a CPU to the PLZA group. Once done it can't be undone
> (except by turing off PLZA?).
>
> -Tony
>
> [More comments about this coming against patch 16]
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group
2026-01-28 22:03 ` Luck, Tony
2026-01-29 18:54 ` Luck, Tony
@ 2026-01-29 19:42 ` Babu Moger
1 sibling, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-29 19:42 UTC (permalink / raw)
To: Luck, Tony, Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
Hi Tony,
On 1/28/26 16:03, Luck, Tony wrote:
> On Wed, Jan 21, 2026 at 03:12:54PM -0600, Babu Moger wrote:
>> Introduce rdtgroup_plza_write() group which enables per group control of
>> PLZA through the resctrl filesystem and ensure that enabling or disabling
>> PLZA is propagated consistently across all CPUs belonging to the group.
>>
>> Enforce the capability checks, exclude default, pseudo-locked and CTRL_MON
>> groups with sub monitors. Also, ensure that only one group can have PLZA
>> enabled at a time.
>>
> ...
>
>> +static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
>> + size_t nbytes, loff_t off)
>> +{
>> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>> + struct rdtgroup *rdtgrp, *prgrp;
>> + int cpu, ret = 0;
>> + bool enable;
> ...
>
>> + /* Enable or disable PLZA state and update per CPU state if there is a change */
>> + if (enable != rdtgrp->plza) {
>> + resctrl_arch_plza_setup(r, rdtgrp->closid, rdtgrp->mon.rmid);
> What is this for? If I've just created a group with no tasks, and empty
> CPU mask ... it seems that this writes the MSR_IA32_PQR_PLZA_ASSOC on
> every CPU in every domain.
>
>> + for_each_cpu(cpu, &rdtgrp->cpu_mask)
>> + resctrl_arch_set_cpu_plza(cpu, rdtgrp->closid,
>> + rdtgrp->mon.rmid, enable);
>> + rdtgrp->plza = enable;
>> + }
>> +
>> +unlock:
>> + rdtgroup_kn_unlock(of->kn);
>> +
>> + return ret ?: nbytes;
>> +}
> It also appears that marking a task as PLZA is permanent. Moving it to
> another group doesn't unmark it. Is this intentional?
>
> # mkdir group1 group2 plza_group
> # echo 1 > plza_group/plza
> # echo $$ > group1/tasks
> # echo $$ > plza_group/tasks
>
> My shell is now in group1 and in the plza_group
> # grep $$ */tasks
> group1/tasks:4125
> plza_group/tasks:4125
>
> Move shell to group2
> # echo $$ > group2/tasks
> # grep $$ */tasks
> group2/tasks:4125
> plza_group/tasks:4125
>
> Succcess in moving to group2, but still in plza_group
You are moving the task from group1 to group2. This basically changes
the association in
MSR_IA32_PQR_ASSOC register, It does not change the PLZA association.
To change it:
a. You either remove task from plza group which triggers task update (tsk->plza = 0)
echo >> /sys/fs/resctrl/plza_group/tasks
b. Or you can change the group as regular group.
echo 0 > /sys/fs/resctrl/plza_group/plza
Thanks for the trying it out.
- Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 14/19] x86,fs/resctrl: Add the functionality to configure PLZA
2026-01-29 19:13 ` Luck, Tony
@ 2026-01-29 19:53 ` Babu Moger
0 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-01-29 19:53 UTC (permalink / raw)
To: Luck, Tony, Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
Hi Tony,
On 1/29/26 13:13, Luck, Tony wrote:
> On Wed, Jan 21, 2026 at 03:12:52PM -0600, Babu Moger wrote:
>> Privilege Level Zero Association (PLZA) is configured by writing to
>> MSR_IA32_PQR_PLZA_ASSOC. PLZA is disabled by default on all logical
>> processors in the QOS Domain. System software must follow the following
>> sequence.
>>
>> 1. Set the closid, closid_en, rmid and rmid_en fields of
>> MSR_IA32_PQR_PLZA_ASSOC to the desired configuration on all logical
>> processors in the QOS Domain.
>>
>> 2. Set MSR_IA32_PQR_PLZA_ASSOC[PLZA_EN]=1 for
>> all logical processors in the QOS domain where PLZA should be enabled.
>>
>> MSR_IA32_PQR_PLZA_ASSOC[PLZA_EN] may have a different value on every
>> logical processor in the QOS domain. The system software should perform
>> this as a read-modify-write to avoid changing the value of closid_en,
>> closid, rmid_en, and rmid fields of MSR_IA32_PQR_PLZA_ASSOC.
> Architecturally this is true. But in the implementation for resctrl
> there is only one PLZA group. So the CLOSID and RMID fields are
> identical on every logical processor. The only changing bit is the
> PLZA_EN.
Correct.
>
> The code could be simpler if you just maintained a single global
> with the CLOSID/RMID bits initialized by resctrl_arch_plza_setup().
>
> union qos_pqr_plza_assoc plza_value; // needs a better name
>
Yea. That is a good point. We don't have to store CLOSID/RMID in
per-CPU state. Will do those changes in my next revision.
> Change the PLZA_EN define to be
>
> #define PLZA_EN BIT_ULL(63)
>
> and then the hook into the __resctrl_sched_in() becomes:
>
>
> if (static_branch_likely(&rdt_plza_enable_key)) {
> u32 plza = READ_ONCE(state->default_plza); // note, moved this inside the static branch
> tmp = READ_ONCE(tsk->plza);
> if (tmp)
> plza = tmp;
>
> if (plza != state->cur_plza) {
> state->cur_plza = plza;
> wrmsrq(MSR_IA32_PQR_PLZA_ASSOC,
> (plza ? PLZA_EN : 0) | plza_value.full);
> }
> }
>
> [Earlier e-mail about clearing the high half of MSR_IA32_PQR_PLZA_ASSOC
> was wrong. My debug trace printed the wrong value. The argument to the
> wrmsrl() is correct].
Got it. Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation
2026-01-21 21:12 ` [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation Babu Moger
@ 2026-02-03 0:00 ` Luck, Tony
2026-02-03 16:38 ` Babu Moger
0 siblings, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-02-03 0:00 UTC (permalink / raw)
To: Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
On Wed, Jan 21, 2026 at 03:12:42PM -0600, Babu Moger wrote:
> +Global Memory bandwidth Allocation
> +-----------------------------------
> +
> +AMD hardware supports Global Memory Bandwidth Allocation (GMBA) provides
> +a mechanism for software to specify bandwidth limits for groups of threads
> +that span across multiple QoS domains. This collection of QOS domains is
> +referred to as GMBA control domain. The GMBA control domain is created by
> +setting the same GMBA limits in one or more QoS domains. Setting the default
> +max_bandwidth excludes the QoS domain from being part of GMBA control domain.
I don't see any checks that the user sets the *SAME* GMBA limits.
What happens if the user ignores the dosumentation and sets different
limits?
... snip ...
+ # cat schemata
+ GMB:0=2048;1=2048;2=2048;3=2048
+ MB:0=4096;1=4096;2=4096;3=4096
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "GMB:0=8;2=8" > schemata
+ # cat schemata
+ GMB:0= 8;1=2048;2= 8;3=2048
+ MB:0=4096;1=4096;2=4096;3=4096
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
Can the user go on to set:
# echo "GMB:1=10;3=10" > schemata
and have domains 0 & 2 with a combined 8GB limit,
while domains 1 & 3 run with a combined 10GB limit?
Or is there a single "GMBA domain"?
Will using "2048" as the "this domain isn't limited
by GMBA" value come back to haunt you when some
system has much more than 2TB bandwidth to divide up?
Should resctrl have a non-numeric "unlimited" value
in the schemata file for this?
The "mba_MBps" feature used U32_MAX as the unlimited
value. But it looks somewhat ugly in the schemata
file:
$ cat schemata
MB:0=4294967295;1=4294967295
L3:0=fff;1=fff
so I'm not sure it is a great precedent.
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation
2026-02-03 0:00 ` Luck, Tony
@ 2026-02-03 16:38 ` Babu Moger
2026-02-09 16:32 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-02-03 16:38 UTC (permalink / raw)
To: Luck, Tony, Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
Hi Tony,
On 2/2/26 18:00, Luck, Tony wrote:
> On Wed, Jan 21, 2026 at 03:12:42PM -0600, Babu Moger wrote:
>> +Global Memory bandwidth Allocation
>> +-----------------------------------
>> +
>> +AMD hardware supports Global Memory Bandwidth Allocation (GMBA) provides
>> +a mechanism for software to specify bandwidth limits for groups of threads
>> +that span across multiple QoS domains. This collection of QOS domains is
>> +referred to as GMBA control domain. The GMBA control domain is created by
>> +setting the same GMBA limits in one or more QoS domains. Setting the default
>> +max_bandwidth excludes the QoS domain from being part of GMBA control domain.
> I don't see any checks that the user sets the *SAME* GMBA limits.
>
> What happens if the user ignores the dosumentation and sets different
> limits?
Good point. Adding checks could be challenging when users update each
schema individually with different values. We don't know which one value
is the one he is intending to keep.
> ... snip ...
>
> + # cat schemata
> + GMB:0=2048;1=2048;2=2048;3=2048
> + MB:0=4096;1=4096;2=4096;3=4096
> + L3:0=ffff;1=ffff;2=ffff;3=ffff
> +
> + # echo "GMB:0=8;2=8" > schemata
> + # cat schemata
> + GMB:0= 8;1=2048;2= 8;3=2048
> + MB:0=4096;1=4096;2=4096;3=4096
> + L3:0=ffff;1=ffff;2=ffff;3=ffff
>
> Can the user go on to set:
>
> # echo "GMB:1=10;3=10" > schemata
>
> and have domains 0 & 2 with a combined 8GB limit,
> while domains 1 & 3 run with a combined 10GB limit?
> Or is there a single "GMBA domain"?
In that case, it is still treated as a single GMBA domain, but the
behavior becomes unpredictable. The hardware expert mentioned that it
will default to the lowest value among all inputs in this case, 8GB.
> Will using "2048" as the "this domain isn't limited
> by GMBA" value come back to haunt you when some
> system has much more than 2TB bandwidth to divide up?
It is actually 4096 (4TB). I made a mistake in the example. I am
assuming it may not an issue in the current generation.
It is expected to go up in next generation.
GMB:0=4096;1=4096;2=4096;3=4096;
MB:0=8192;1=8192;2=8192;3=8192;
L3:0=ffff;1=ffff;2=ffff;3=ffff
>
> Should resctrl have a non-numeric "unlimited" value
> in the schemata file for this?
The value 4096 corresponds to 12th bit set. It is called U-bit. If the
U bit is set then that domain is not part of the GMBA domain.
I was thinking of displaying the "U" in those cases. It may be good
idea to do something like this.
GMB:0= 8;1= U;2= 8 ;3= U;
MB:0=8192;1=8192;2=8192;3=8192;
L3:0=ffff;1=ffff;2=ffff;3=ffff
>
> The "mba_MBps" feature used U32_MAX as the unlimited
> value. But it looks somewhat ugly in the schemata
> file:
Yes, I agree. Non-numeric would have been better.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
` (18 preceding siblings ...)
2026-01-21 21:12 ` [RFC PATCH 19/19] fs/resctrl: Add per-task PLZA enable support via rdtgroup Babu Moger
@ 2026-02-03 19:58 ` Luck, Tony
2026-02-10 0:27 ` Reinette Chatre
19 siblings, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-02-03 19:58 UTC (permalink / raw)
To: Babu Moger, Drew Fustini, James Morse, Dave Martin
Cc: corbet, reinette.chatre, tglx, mingo, bp, dave.hansen, x86, hpa,
peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta, pmladek,
feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas, seanjc, xin,
manali.shukla, dapeng1.mi, chang.seok.bae, mario.limonciello,
naveen, elena.reshetova, thomas.lendacky, linux-doc, linux-kernel,
kvm, peternewman, eranian, gautham.shenoy
On Wed, Jan 21, 2026 at 03:12:38PM -0600, Babu Moger wrote:
> Privilege Level Zero Association (PLZA)
>
> Privilege Level Zero Association (PLZA) allows the hardware to
> automatically associate execution in Privilege Level Zero (CPL=0) with a
> specific COS (Class of Service) and/or RMID (Resource Monitoring
> Identifier). The QoS feature set already has a mechanism to associate
> execution on each logical processor with an RMID or COS. PLZA allows the
> system to override this per-thread association for a thread that is
> executing with CPL=0.
Adding Drew, and prodding Dave & James, for this discussion.
At LPC it was stated that both ARM and RISC-V already have support
to run kernel code with different quality of service parameters from
user code.
I'm thinking that Babu's implementation for resctrl may be over
engineered. Specifically the part that allows users to put some
tasks into the PLZA group, while leaving others in a mode where
kernel code runs with same QoS parameters as user code.
That comes at a cost of complexity, and performance in the context
switch code.
But maybe I'm missing some practical case where users want that
behaviour.
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 03/19] fs/resctrl: Add new interface max_bandwidth
2026-01-21 21:12 ` [RFC PATCH 03/19] fs/resctrl: Add new interface max_bandwidth Babu Moger
@ 2026-02-06 23:58 ` Reinette Chatre
2026-02-09 23:52 ` Moger, Babu
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-06 23:58 UTC (permalink / raw)
To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx,
mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 1/21/26 1:12 PM, Babu Moger wrote:
> While min_bandwidth is exposed for each resource under
> /sys/fs/resctrl, the maximum supported bandwidth is not currently shown.
>
> Add max_bandwidth to report the maximum bandwidth permitted for a resource.
> This helps users understand the limits of the associated resource control
> group.
>
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
With resctrl fs being used by several architectures we should take care that
interface changes take all planned usages into account.
As shared at LPC [1] and email [2] we are already trying to create an interface
that works for everybody and it already contains a way to expose the maximum
bandwidth to user space. You attended that LPC session and [2] directed to you
received no response. This submission with a different interface is unexpected.
Reinette
[1] https://lpc.events/event/19/contributions/2093/attachments/1958/4172/resctrl%20Microconference%20LPC%202025%20Tokyo.pdf
[2] https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@intel.com/
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation
2026-02-03 16:38 ` Babu Moger
@ 2026-02-09 16:32 ` Reinette Chatre
2026-02-10 19:44 ` Babu Moger
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-09 16:32 UTC (permalink / raw)
To: Babu Moger, Luck, Tony, Babu Moger
Cc: corbet, Dave.Martin, james.morse, tglx, mingo, bp, dave.hansen,
x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu and Tony,
On 2/3/26 8:38 AM, Babu Moger wrote:
> Hi Tony,
>
> On 2/2/26 18:00, Luck, Tony wrote:
>> On Wed, Jan 21, 2026 at 03:12:42PM -0600, Babu Moger wrote:
>>> +Global Memory bandwidth Allocation
>>> +-----------------------------------
>>> +
>>> +AMD hardware supports Global Memory Bandwidth Allocation (GMBA) provides
>>> +a mechanism for software to specify bandwidth limits for groups of threads
>>> +that span across multiple QoS domains. This collection of QOS domains is
>>> +referred to as GMBA control domain. The GMBA control domain is created by
>>> +setting the same GMBA limits in one or more QoS domains. Setting the default
>>> +max_bandwidth excludes the QoS domain from being part of GMBA control domain.
>> I don't see any checks that the user sets the *SAME* GMBA limits.
>>
>> What happens if the user ignores the dosumentation and sets different
>> limits?
>
> Good point. Adding checks could be challenging when users update each schema individually with different values. We don't know which one value is the one he is intending to keep.
>
>> ... snip ...
>>
>> + # cat schemata
>> + GMB:0=2048;1=2048;2=2048;3=2048
>> + MB:0=4096;1=4096;2=4096;3=4096
>> + L3:0=ffff;1=ffff;2=ffff;3=ffff
>> +
>> + # echo "GMB:0=8;2=8" > schemata
>> + # cat schemata
>> + GMB:0= 8;1=2048;2= 8;3=2048
>> + MB:0=4096;1=4096;2=4096;3=4096
>> + L3:0=ffff;1=ffff;2=ffff;3=ffff
>>
>> Can the user go on to set:
>>
>> # echo "GMB:1=10;3=10" > schemata
>>
>> and have domains 0 & 2 with a combined 8GB limit,
>> while domains 1 & 3 run with a combined 10GB limit?
>> Or is there a single "GMBA domain"?
>
> In that case, it is still treated as a single GMBA domain, but the behavior becomes unpredictable. The hardware expert mentioned that it will default to the lowest value among all inputs in this case, 8GB.
>
>
>> Will using "2048" as the "this domain isn't limited
>> by GMBA" value come back to haunt you when some
>> system has much more than 2TB bandwidth to divide up?
>
> It is actually 4096 (4TB). I made a mistake in the example. I am assuming it may not an issue in the current generation.
>
> It is expected to go up in next generation.
>
> GMB:0=4096;1=4096;2=4096;3=4096;
> MB:0=8192;1=8192;2=8192;3=8192;
> L3:0=ffff;1=ffff;2=ffff;3=ffff
>
>
>>
>> Should resctrl have a non-numeric "unlimited" value
>> in the schemata file for this?
>
> The value 4096 corresponds to 12th bit set. It is called U-bit. If the U bit is set then that domain is not part of the GMBA domain.
>
> I was thinking of displaying the "U" in those cases. It may be good idea to do something like this.
>
> GMB:0= 8;1= U;2= 8 ;3= U;
> MB:0=8192;1=8192;2=8192;3=8192;
> L3:0=ffff;1=ffff;2=ffff;3=ffff
>
>
>>
>> The "mba_MBps" feature used U32_MAX as the unlimited
>> value. But it looks somewhat ugly in the schemata
>> file:
> Yes, I agree. Non-numeric would have been better.
How would such a value be described in a generic way as part of the new schema
description format?
Since the proposed format contains a maximum I think just using that
value may be simplest while matching what is currently displayed for
"unlimited" MB, no?
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-01-21 21:12 ` [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE) Babu Moger
@ 2026-02-09 18:44 ` Reinette Chatre
2026-02-11 1:07 ` Moger, Babu
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-09 18:44 UTC (permalink / raw)
To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx,
mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 1/21/26 1:12 PM, Babu Moger wrote:
> On AMD systems, the existing MBA feature allows the user to set a bandwidth
> limit for each QOS domain. However, multiple QOS domains share system
> memory bandwidth as a resource. In order to ensure that system memory
> bandwidth is not over-utilized, user must statically partition the
> available system bandwidth between the active QOS domains. This typically
How do you define "active" QoS Domain?
> results in system memory being under-utilized since not all QOS domains are
> using their full bandwidth Allocation.
>
> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
> for software to specify bandwidth limits for groups of threads that span
> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
> in GLBE control domain. Bandwidth is shared by all threads in a Class of
> Service(COS) across every QoS domain managed by the GLBE control domain.
How does this bandwidth allocation limit impact existing MBA? For example, if a
system has two domains (A and B) that user space separately sets MBA
allocations for while also placing both domains within a "GLBE control domain"
with a different allocation, does the individual MBA allocations still matter?
From the description it sounds as though there is a new "memory bandwidth
ceiling/limit" that seems to imply that MBA allocations are limited by
GMBA allocations while the proposed user interface present them as independent.
If there is indeed some dependency here ... while MBA and GMBA CLOSID are
enumerated separately, under which scenario will GMBA and MBA support different
CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
can be seen as a single "resource" that can be allocated differently based on
the various schemata associated with that resource. This currently has a
dependency on the various schemata supporting the same number of CLOSID which
may be something that we can reconsider?
Reinette
[1] https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@intel.com/
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 03/19] fs/resctrl: Add new interface max_bandwidth
2026-02-06 23:58 ` Reinette Chatre
@ 2026-02-09 23:52 ` Moger, Babu
0 siblings, 0 replies; 114+ messages in thread
From: Moger, Babu @ 2026-02-09 23:52 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
Thanks for the comments. Will try to respond one by one.
On 2/6/2026 5:58 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/21/26 1:12 PM, Babu Moger wrote:
>> While min_bandwidth is exposed for each resource under
>> /sys/fs/resctrl, the maximum supported bandwidth is not currently shown.
>>
>> Add max_bandwidth to report the maximum bandwidth permitted for a resource.
>> This helps users understand the limits of the associated resource control
>> group.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>
> With resctrl fs being used by several architectures we should take care that
> interface changes take all planned usages into account.
>
> As shared at LPC [1] and email [2] we are already trying to create an interface
> that works for everybody and it already contains a way to expose the maximum
> bandwidth to user space. You attended that LPC session and [2] directed to you
> received no response. This submission with a different interface is unexpected.
Thanks for pointing this out. Yes. I missed that thread.
> Reinette
>
> [1] https://lpc.events/event/19/contributions/2093/attachments/1958/4172/resctrl%20Microconference%20LPC%202025%20Tokyo.pdf
> [2] https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@intel.com/
I need to look into this much closely. Our current plan is to support MB
and GMBA with L3 scope. So, with that in mind, I am not seeing a use
case in that context for now. I can remove exposing max_bandwidth until
we have a unified approach.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group
2026-01-21 21:12 ` [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group Babu Moger
2026-01-28 22:03 ` Luck, Tony
@ 2026-02-10 0:05 ` Reinette Chatre
2026-02-11 23:10 ` Moger, Babu
1 sibling, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-10 0:05 UTC (permalink / raw)
To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx,
mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 1/21/26 1:12 PM, Babu Moger wrote:
> +static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off)
> +{
> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
Hardcoding PLZA configuration to the L3 resource is unexpected, especially since
PLZA's impact and configuration on MBA is mentioned a couple of times in this
series and discussions that followed. There also does not seem to be any
"per resource" PLZA capability but instead when system supports PLZA
RDT_RESOURCE_L2, RDT_RESOURCE_L3, and RDT_RESOURCE_MBA are automatically (if
resources are present) set to support it.
From what I understand PLZA enables user space to configure CLOSID and RMID
used in CPL=0 independent from resource. That is, when a user configures
PLZA with this interface all allocation information for all resources in
resource group's schemata applies.
Since this implementation makes "plza" a per-resource property it makes possible
scenarios where some resources support plza while others do not. From what I
can tell this is not reflected by the schemata file associated with a
"plza" resource group that continues to enable user space to change
allocations of all resources, whether they support plza or not.
Why was PLZA determined to be a per-resource property? It instead seems to
have larger scope? The cycle introduced in patch #9 where the arch sets
a per-'resctrl fs' resource property and then forces resctrl fs to query
the arch for its own property seems unnecessary. Could this support just
be a global property that resctrl fs can query from the arch?
> + struct rdtgroup *rdtgrp, *prgrp;
> + int cpu, ret = 0;
> + bool enable;
> +
> + ret = kstrtobool(buf, &enable);
> + if (ret)
> + return ret;
> +
> + rdtgrp = rdtgroup_kn_lock_live(of->kn);
> + if (!rdtgrp) {
> + rdtgroup_kn_unlock(of->kn);
> + return -ENOENT;
> + }
> +
> + rdt_last_cmd_clear();
> +
> + if (!r->plza_capable) {
> + rdt_last_cmd_puts("PLZA is not supported in the system\n");
> + ret = -EINVAL;
> + goto unlock;
> + }
> +
> + if (rdtgrp == &rdtgroup_default) {
> + rdt_last_cmd_puts("Cannot set PLZA on a default group\n");
> + ret = -EINVAL;
> + goto unlock;
> + }
> +
> + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
> + rdt_last_cmd_puts("Resource group is pseudo-locked\n");
> + ret = -EINVAL;
> + goto unlock;
> + }
> +
> + if (!list_empty(&rdtgrp->mon.crdtgrp_list)) {
> + rdt_last_cmd_puts("Cannot change CTRL_MON group with sub monitor groups\n");
> + ret = -EINVAL;
> + goto unlock;
> + }
From what I can tell it is still possible to add monitor groups after a
CTRL_MON group is designated "plza".
If repurposing a CTRL_MON group to operate with different constraints we should
take care how user can still continue to interact with existing files/directories
as a group transitions between plza and non-plza. One option could be to hide files
as needed to prevent user from interacting with them, another option needs to add
extra checks on all the paths that interact with these files and directories.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association
2026-02-03 19:58 ` [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Luck, Tony
@ 2026-02-10 0:27 ` Reinette Chatre
2026-02-11 0:40 ` Drew Fustini
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-10 0:27 UTC (permalink / raw)
To: Luck, Tony, Babu Moger, Drew Fustini, James Morse, Dave Martin,
Ben Horgan
Cc: corbet, tglx, mingo, bp, dave.hansen, x86, hpa, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, akpm, pawan.kumar.gupta, pmladek, feng.tang,
kees, arnd, fvdl, lirongqing, bhelgaas, seanjc, xin,
manali.shukla, dapeng1.mi, chang.seok.bae, mario.limonciello,
naveen, elena.reshetova, thomas.lendacky, linux-doc, linux-kernel,
kvm, peternewman, eranian, gautham.shenoy
Adding Ben
On 2/3/26 11:58 AM, Luck, Tony wrote:
> On Wed, Jan 21, 2026 at 03:12:38PM -0600, Babu Moger wrote:
>> Privilege Level Zero Association (PLZA)
>>
>> Privilege Level Zero Association (PLZA) allows the hardware to
>> automatically associate execution in Privilege Level Zero (CPL=0) with a
>> specific COS (Class of Service) and/or RMID (Resource Monitoring
>> Identifier). The QoS feature set already has a mechanism to associate
>> execution on each logical processor with an RMID or COS. PLZA allows the
>> system to override this per-thread association for a thread that is
>> executing with CPL=0.
>
> Adding Drew, and prodding Dave & James, for this discussion.
>
> At LPC it was stated that both ARM and RISC-V already have support
> to run kernel code with different quality of service parameters from
> user code.
>
> I'm thinking that Babu's implementation for resctrl may be over
> engineered. Specifically the part that allows users to put some
> tasks into the PLZA group, while leaving others in a mode where
> kernel code runs with same QoS parameters as user code.
>
> That comes at a cost of complexity, and performance in the context
> switch code.
>
> But maybe I'm missing some practical case where users want that
> behaviour.
>
> -Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-28 17:44 ` Moger, Babu
2026-01-28 19:17 ` Luck, Tony
@ 2026-02-10 16:17 ` Reinette Chatre
2026-02-10 18:04 ` Reinette Chatre
2026-02-13 16:37 ` Moger, Babu
1 sibling, 2 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-10 16:17 UTC (permalink / raw)
To: Moger, Babu, Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Babu,
On 1/28/26 9:44 AM, Moger, Babu wrote:
>
>
> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>> Babu,
>>>
>>> I've read a bit more of the code now and I think I understand more.
>>>
>>> Some useful additions to your explanation.
>>>
>>> 1) Only one CTRL group can be marked as PLZA
>>
>> Yes. Correct.
Why limit it to one CTRL_MON group and why not support it for MON groups?
Limiting it to a single CTRL group seems restrictive in a few ways:
1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
number of use cases that can be supported. Consider, for example, an existing
"high priority" resource group and a "low priority" resource group. The user may
just want to let the tasks in the "low priority" resource group run as "high priority"
when in CPL0. This of course may depend on what resources are allocated, for example
cache may need more care, but if, for example, user is only interested in memory
bandwidth allocation this seems a reasonable use case?
2) Similar to what Tony [1] mentioned this does not enable what the hardware is
capable of in terms of number of different control groups/CLOSID that can be
assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
example, create a resource group that contains tasks of interest and create
a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
This will give user space better insight into system behavior and from what I can
tell is supported by the feature but not enabled?
>>
>>> 2) It can't be the root/default group
>>
>> This is something I added to keep the default group in a un-disturbed,
Why was this needed?
>>
>>> 3) It can't have sub monitor groups
Why not?
>>> 4) It can't be pseudo-locked
>>
>> Yes.
>>
>>>
>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>> need to change.
>>
>> Yes. That can be one use case.
>>
>>>
>>> If that is the case, maybe for the PLZA group we should allow user to
>>> do:
>>>
>>> # echo '*' > tasks
Dedicating a resource group to "PLZA" seems restrictive while also adding many
complications since this designation makes resource group behave differently and
thus the files need to get extra "treatments" to handle this "PLZA" designation.
I am wondering if it will not be simpler to introduce just one new file, for example
"tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
resource group to manage user space and kernel space allocations while also supporting
various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
use case where user space can create a new resource group with certain allocations but the
"tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
the resource group's allocations when in CPL0.
Reinette
[1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-10 16:17 ` Reinette Chatre
@ 2026-02-10 18:04 ` Reinette Chatre
2026-02-11 16:40 ` Ben Horgan
2026-02-13 16:37 ` Moger, Babu
1 sibling, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-10 18:04 UTC (permalink / raw)
To: Moger, Babu, Moger, Babu, Luck, Tony, Ben Horgan, Drew Fustini
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
+Ben and Drew
On 2/10/26 8:17 AM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>
>>
>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>> Babu,
>>>>
>>>> I've read a bit more of the code now and I think I understand more.
>>>>
>>>> Some useful additions to your explanation.
>>>>
>>>> 1) Only one CTRL group can be marked as PLZA
>>>
>>> Yes. Correct.
>
> Why limit it to one CTRL_MON group and why not support it for MON groups?
>
> Limiting it to a single CTRL group seems restrictive in a few ways:
> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> number of use cases that can be supported. Consider, for example, an existing
> "high priority" resource group and a "low priority" resource group. The user may
> just want to let the tasks in the "low priority" resource group run as "high priority"
> when in CPL0. This of course may depend on what resources are allocated, for example
> cache may need more care, but if, for example, user is only interested in memory
> bandwidth allocation this seems a reasonable use case?
> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> capable of in terms of number of different control groups/CLOSID that can be
> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> example, create a resource group that contains tasks of interest and create
> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> This will give user space better insight into system behavior and from what I can
> tell is supported by the feature but not enabled?
>
>>>
>>>> 2) It can't be the root/default group
>>>
>>> This is something I added to keep the default group in a un-disturbed,
>
> Why was this needed?
>
>>>
>>>> 3) It can't have sub monitor groups
>
> Why not?
>
>>>> 4) It can't be pseudo-locked
>>>
>>> Yes.
>>>
>>>>
>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>> need to change.
>>>
>>> Yes. That can be one use case.
>>>
>>>>
>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>> do:
>>>>
>>>> # echo '*' > tasks
>
> Dedicating a resource group to "PLZA" seems restrictive while also adding many
> complications since this designation makes resource group behave differently and
> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>
> I am wondering if it will not be simpler to introduce just one new file, for example
> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> resource group to manage user space and kernel space allocations while also supporting
> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> use case where user space can create a new resource group with certain allocations but the
> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> the resource group's allocations when in CPL0.
It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
instead of CPL0 using something like "kernel" or ... ?
I have not read anything about the RISC-V side of this yet.
Reinette
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation
2026-02-09 16:32 ` Reinette Chatre
@ 2026-02-10 19:44 ` Babu Moger
0 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-02-10 19:44 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, Luck, Tony
Cc: corbet, Dave.Martin, james.morse, tglx, mingo, bp, dave.hansen,
x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/9/26 10:32, Reinette Chatre wrote:
> Hi Babu and Tony,
>
> On 2/3/26 8:38 AM, Babu Moger wrote:
>> Hi Tony,
>>
>> On 2/2/26 18:00, Luck, Tony wrote:
>>> On Wed, Jan 21, 2026 at 03:12:42PM -0600, Babu Moger wrote:
>>>> +Global Memory bandwidth Allocation
>>>> +-----------------------------------
>>>> +
>>>> +AMD hardware supports Global Memory Bandwidth Allocation (GMBA) provides
>>>> +a mechanism for software to specify bandwidth limits for groups of threads
>>>> +that span across multiple QoS domains. This collection of QOS domains is
>>>> +referred to as GMBA control domain. The GMBA control domain is created by
>>>> +setting the same GMBA limits in one or more QoS domains. Setting the default
>>>> +max_bandwidth excludes the QoS domain from being part of GMBA control domain.
>>> I don't see any checks that the user sets the *SAME* GMBA limits.
>>>
>>> What happens if the user ignores the dosumentation and sets different
>>> limits?
>> Good point. Adding checks could be challenging when users update each schema individually with different values. We don't know which one value is the one he is intending to keep.
>>
>>> ... snip ...
>>>
>>> + # cat schemata
>>> + GMB:0=2048;1=2048;2=2048;3=2048
>>> + MB:0=4096;1=4096;2=4096;3=4096
>>> + L3:0=ffff;1=ffff;2=ffff;3=ffff
>>> +
>>> + # echo "GMB:0=8;2=8" > schemata
>>> + # cat schemata
>>> + GMB:0= 8;1=2048;2= 8;3=2048
>>> + MB:0=4096;1=4096;2=4096;3=4096
>>> + L3:0=ffff;1=ffff;2=ffff;3=ffff
>>>
>>> Can the user go on to set:
>>>
>>> # echo "GMB:1=10;3=10" > schemata
>>>
>>> and have domains 0 & 2 with a combined 8GB limit,
>>> while domains 1 & 3 run with a combined 10GB limit?
>>> Or is there a single "GMBA domain"?
>> In that case, it is still treated as a single GMBA domain, but the behavior becomes unpredictable. The hardware expert mentioned that it will default to the lowest value among all inputs in this case, 8GB.
>>
>>
>>> Will using "2048" as the "this domain isn't limited
>>> by GMBA" value come back to haunt you when some
>>> system has much more than 2TB bandwidth to divide up?
>> It is actually 4096 (4TB). I made a mistake in the example. I am assuming it may not an issue in the current generation.
>>
>> It is expected to go up in next generation.
>>
>> GMB:0=4096;1=4096;2=4096;3=4096;
>> MB:0=8192;1=8192;2=8192;3=8192;
>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>
>>
>>> Should resctrl have a non-numeric "unlimited" value
>>> in the schemata file for this?
>> The value 4096 corresponds to 12th bit set. It is called U-bit. If the U bit is set then that domain is not part of the GMBA domain.
>>
>> I was thinking of displaying the "U" in those cases. It may be good idea to do something like this.
>>
>> GMB:0= 8;1= U;2= 8 ;3= U;
>> MB:0=8192;1=8192;2=8192;3=8192;
>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>
>>
>>> The "mba_MBps" feature used U32_MAX as the unlimited
>>> value. But it looks somewhat ugly in the schemata
>>> file:
>> Yes, I agree. Non-numeric would have been better.
> How would such a value be described in a generic way as part of the new schema
> description format?
I dont think we need any special handling. We should report the actual
numeric value for max in new format.
> Since the proposed format contains a maximum I think just using that
> value may be simplest while matching what is currently displayed for
> "unlimited" MB, no?
>
Yea. I t should be ok to display the max value here.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association
2026-02-10 0:27 ` Reinette Chatre
@ 2026-02-11 0:40 ` Drew Fustini
0 siblings, 0 replies; 114+ messages in thread
From: Drew Fustini @ 2026-02-11 0:40 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, Babu Moger, James Morse, Dave Martin, Ben Horgan,
corbet, tglx, mingo, bp, dave.hansen, x86, hpa, peterz,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, akpm, pawan.kumar.gupta, pmladek, feng.tang,
kees, arnd, fvdl, lirongqing, bhelgaas, seanjc, xin,
manali.shukla, dapeng1.mi, chang.seok.bae, mario.limonciello,
naveen, elena.reshetova, thomas.lendacky, linux-doc, linux-kernel,
kvm, peternewman, eranian, gautham.shenoy
On Mon, Feb 09, 2026 at 04:27:47PM -0800, Reinette Chatre wrote:
> Adding Ben
>
> On 2/3/26 11:58 AM, Luck, Tony wrote:
> > On Wed, Jan 21, 2026 at 03:12:38PM -0600, Babu Moger wrote:
> >> Privilege Level Zero Association (PLZA)
> >>
> >> Privilege Level Zero Association (PLZA) allows the hardware to
> >> automatically associate execution in Privilege Level Zero (CPL=0) with a
> >> specific COS (Class of Service) and/or RMID (Resource Monitoring
> >> Identifier). The QoS feature set already has a mechanism to associate
> >> execution on each logical processor with an RMID or COS. PLZA allows the
> >> system to override this per-thread association for a thread that is
> >> executing with CPL=0.
> >
> > Adding Drew, and prodding Dave & James, for this discussion.
> >
> > At LPC it was stated that both ARM and RISC-V already have support
> > to run kernel code with different quality of service parameters from
> > user code.
Sorry, for RISC-V, I should clarify that there is no hardware feature
that changes the QoS identifier value when switching between kernel mode
and user mode. This could be done in the kernel task switching code, but
there is no implicit hardware operation.
Thanks,
Drew
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-09 18:44 ` Reinette Chatre
@ 2026-02-11 1:07 ` Moger, Babu
2026-02-11 16:54 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Moger, Babu @ 2026-02-11 1:07 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/9/2026 12:44 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/21/26 1:12 PM, Babu Moger wrote:
>> On AMD systems, the existing MBA feature allows the user to set a bandwidth
>> limit for each QOS domain. However, multiple QOS domains share system
>> memory bandwidth as a resource. In order to ensure that system memory
>> bandwidth is not over-utilized, user must statically partition the
>> available system bandwidth between the active QOS domains. This typically
>
> How do you define "active" QoS Domain?
Some domains may not have any CPUs associated with that CLOSID. Active
meant, I'm referring to domains that have CPUs assigned to the CLOSID.
>
>> results in system memory being under-utilized since not all QOS domains are
>> using their full bandwidth Allocation.
>>
>> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
>> for software to specify bandwidth limits for groups of threads that span
>> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
>> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
>> in GLBE control domain. Bandwidth is shared by all threads in a Class of
>> Service(COS) across every QoS domain managed by the GLBE control domain.
>
> How does this bandwidth allocation limit impact existing MBA? For example, if a
> system has two domains (A and B) that user space separately sets MBA
> allocations for while also placing both domains within a "GLBE control domain"
> with a different allocation, does the individual MBA allocations still matter?
Yes. Both ceilings are enforced at their respective levels.
The MBA ceiling is applied at the QoS domain level.
The GLBE ceiling is applied at the GLBE control domain level.
If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit
will be capped by the GLBE ceiling.
>>From the description it sounds as though there is a new "memory bandwidth
> ceiling/limit" that seems to imply that MBA allocations are limited by
> GMBA allocations while the proposed user interface present them as independent.
>
> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
> enumerated separately, under which scenario will GMBA and MBA support different
> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
I can see the following scenarios where MBA and GMBA can operate
independently:
1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an
independent CLOS.
2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an
independent CLOS.
I hope this clarifies your question.
> can be seen as a single "resource" that can be allocated differently based on
> the various schemata associated with that resource. This currently has a
> dependency on the various schemata supporting the same number of CLOSID which
> may be something that we can reconsider?
After reviewing the new proposal again, I’m still unsure how all the
pieces will fit together. MBA and GMBA share the same scope and have
inter-dependencies. Without the full implementation details, it’s
difficult for me to provide meaningful feedback on new approach.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure
2026-01-21 21:12 ` [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure Babu Moger
@ 2026-02-11 15:19 ` Ben Horgan
2026-02-11 16:54 ` Reinette Chatre
2026-02-13 15:50 ` Moger, Babu
0 siblings, 2 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-11 15:19 UTC (permalink / raw)
To: Babu Moger, corbet, tony.luck, reinette.chatre, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 1/21/26 21:12, Babu Moger wrote:
> Add plza_capable field to the rdt_resource structure to indicate whether
> Privilege Level Zero Association (PLZA) is supported for that resource
> type.
>
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++++
> include/linux/resctrl.h | 3 +++
> 3 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 2de3140dd6d1..e41fe5fa3f30 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -295,6 +295,9 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
>
> r->alloc_capable = true;
>
> + if (rdt_cpu_has(X86_FEATURE_PLZA))
> + r->plza_capable = true;
> +
> return true;
> }
>
> @@ -314,6 +317,9 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
> if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> r->cache.arch_has_sparse_bitmasks = ecx.split.noncont;
> r->alloc_capable = true;
> +
> + if (rdt_cpu_has(X86_FEATURE_PLZA))
> + r->plza_capable = true;
> }
>
> static void rdt_get_cdp_config(int level)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 885026468440..540e1e719d7f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -229,6 +229,11 @@ bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
> return rdt_resources_all[l].cdp_enabled;
> }
>
> +bool resctrl_arch_get_plza_capable(enum resctrl_res_level l)
> +{
> + return rdt_resources_all[l].r_resctrl.plza_capable;
> +}
> +
> void resctrl_arch_reset_all_ctrls(struct rdt_resource *r)
> {
> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 63d74c0dbb8f..ae252a0e6d92 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -319,6 +319,7 @@ struct resctrl_mon {
> * @name: Name to use in "schemata" file.
> * @schema_fmt: Which format string and parser is used for this schema.
> * @cdp_capable: Is the CDP feature available on this resource
> + * @plza_capable: Is Privilege Level Zero Association capable?
> */
> struct rdt_resource {
> int rid;
> @@ -334,6 +335,7 @@ struct rdt_resource {
> char *name;
> enum resctrl_schema_fmt schema_fmt;
> bool cdp_capable;
> + bool plza_capable;
Why are you making plza a resource property? Certainly for MPAM we'd
want this to be global across resources and I see above that you are
just checking a cpu property rather then anything per resource.
> };
>
> /*
> @@ -481,6 +483,7 @@ static inline u32 resctrl_get_config_index(u32 closid,
>
> bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l);
> int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
> +bool resctrl_arch_get_plza_capable(enum resctrl_res_level l);
>
> /**
> * resctrl_arch_mbm_cntr_assign_enabled() - Check if MBM counter assignment
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-10 18:04 ` Reinette Chatre
@ 2026-02-11 16:40 ` Ben Horgan
2026-02-11 19:46 ` Luck, Tony
2026-02-11 22:22 ` Reinette Chatre
0 siblings, 2 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-11 16:40 UTC (permalink / raw)
To: Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi,
Thanks for including me.
On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
> +Ben and Drew
>
> On 2/10/26 8:17 AM, Reinette Chatre wrote:
> > Hi Babu,
> >
> > On 1/28/26 9:44 AM, Moger, Babu wrote:
> >>
> >>
> >> On 1/28/2026 11:41 AM, Moger, Babu wrote:
> >>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> >>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
> >>>> Babu,
> >>>>
> >>>> I've read a bit more of the code now and I think I understand more.
> >>>>
> >>>> Some useful additions to your explanation.
> >>>>
> >>>> 1) Only one CTRL group can be marked as PLZA
> >>>
> >>> Yes. Correct.
> >
> > Why limit it to one CTRL_MON group and why not support it for MON groups?
> >
> > Limiting it to a single CTRL group seems restrictive in a few ways:
> > 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> > number of use cases that can be supported. Consider, for example, an existing
> > "high priority" resource group and a "low priority" resource group. The user may
> > just want to let the tasks in the "low priority" resource group run as "high priority"
> > when in CPL0. This of course may depend on what resources are allocated, for example
> > cache may need more care, but if, for example, user is only interested in memory
> > bandwidth allocation this seems a reasonable use case?
> > 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> > capable of in terms of number of different control groups/CLOSID that can be
> > assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> > 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> > MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> > example, create a resource group that contains tasks of interest and create
> > a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> > This will give user space better insight into system behavior and from what I can
> > tell is supported by the feature but not enabled?
> >
> >>>
> >>>> 2) It can't be the root/default group
> >>>
> >>> This is something I added to keep the default group in a un-disturbed,
> >
> > Why was this needed?
> >
> >>>
> >>>> 3) It can't have sub monitor groups
> >
> > Why not?
> >
> >>>> 4) It can't be pseudo-locked
> >>>
> >>> Yes.
> >>>
> >>>>
> >>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
> >>>> would avoid any additional context switch overhead as the PLZA MSR would never
> >>>> need to change.
> >>>
> >>> Yes. That can be one use case.
> >>>
> >>>>
> >>>> If that is the case, maybe for the PLZA group we should allow user to
> >>>> do:
> >>>>
> >>>> # echo '*' > tasks
> >
> > Dedicating a resource group to "PLZA" seems restrictive while also adding many
> > complications since this designation makes resource group behave differently and
> > thus the files need to get extra "treatments" to handle this "PLZA" designation.
> >
> > I am wondering if it will not be simpler to introduce just one new file, for example
> > "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> > file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> > task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> > resource group to manage user space and kernel space allocations while also supporting
> > various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> > use case where user space can create a new resource group with certain allocations but the
> > "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> > the resource group's allocations when in CPL0.
If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
>
> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
> instead of CPL0 using something like "kernel" or ... ?
Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
internally and here are a few thoughts.
If the user case is just that an option run all tasks with the same closid/rmid
(partid/pmg) configuration when they are running in the kernel then I'd favour a
mount option. The resctrl filesytem interface doesn't need to change and
userspace software doesn't need to change. This could either take away a
closid/rmid from userspace and dedicate it to the kernel or perhaps have a
policy to have the default group as the kernel group. If you use the default
configuration, at least for MPAM, the kernel may not be running at the highest
priority as a minimum bandwidth can be used to give a priority boost. (Once we
have a resctrl schema for this.)
It could be useful to have something a bit more featureful though. Is there a
need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
it's not a single write to move a task. If a single mapping is sufficient, then
as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
suggested above but rather than a task that file could hold a path to the
CTRL_MON/MON group that provides the kernel configuraion for tasks running in
that group. So that this can be transparent to existing software an empty string
can mean use the current group's when in the kernel (as well as for
userspace). A slash, /, could be used to refer to the default group. This would
give something like the below under /sys/fs/resctrl.
.
├── cpus
├── tasks
├── ctrl1
│ ├── cpus
│ ├── kernel_group -> mon_groups/mon1
│ └── tasks
├── kernel_group -> ctrl1
└── mon_groups
└── mon1
├── cpus
├── kernel_group -> ctrl1
└── tasks
>
> I have not read anything about the RISC-V side of this yet.
>
> Reinette
>
> >
> > Reinette
> >
> > [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-11 1:07 ` Moger, Babu
@ 2026-02-11 16:54 ` Reinette Chatre
2026-02-11 21:18 ` Babu Moger
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-11 16:54 UTC (permalink / raw)
To: Moger, Babu, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 2/10/26 5:07 PM, Moger, Babu wrote:
> Hi Reinette,
>
>
> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>> On AMD systems, the existing MBA feature allows the user to set a bandwidth
>>> limit for each QOS domain. However, multiple QOS domains share system
>>> memory bandwidth as a resource. In order to ensure that system memory
>>> bandwidth is not over-utilized, user must statically partition the
>>> available system bandwidth between the active QOS domains. This typically
>>
>> How do you define "active" QoS Domain?
>
> Some domains may not have any CPUs associated with that CLOSID. Active meant, I'm referring to domains that have CPUs assigned to the CLOSID.
To confirm, is this then specific to assigning CPUs to resource groups via
the cpus/cpus_list files? This refers to how a user needs to partition
available bandwidth so I am still trying to understand the message here since
users still need to do this even when CPUs are not assigned to resource
groups.
>
>>
>>> results in system memory being under-utilized since not all QOS domains are
>>> using their full bandwidth Allocation.
>>>
>>> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
>>> for software to specify bandwidth limits for groups of threads that span
>>> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
>>> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
>>> in GLBE control domain. Bandwidth is shared by all threads in a Class of
>>> Service(COS) across every QoS domain managed by the GLBE control domain.
>>
>> How does this bandwidth allocation limit impact existing MBA? For example, if a
>> system has two domains (A and B) that user space separately sets MBA
>> allocations for while also placing both domains within a "GLBE control domain"
>> with a different allocation, does the individual MBA allocations still matter?
>
> Yes. Both ceilings are enforced at their respective levels.
> The MBA ceiling is applied at the QoS domain level.
> The GLBE ceiling is applied at the GLBE control domain level.
> If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
It sounds as though MBA and GMBA/GLBE operates within the same parameters wrt
the limits but in examples in this series they have different limits. For example,
in the documentation patch [1] there is this:
# cat schemata
GMB:0=2048;1=2048;2=2048;3=2048
MB:0=4096;1=4096;2=4096;3=4096
L3:0=ffff;1=ffff;2=ffff;3=ffff
followed up with what it will look like in new generation [2]:
GMB:0=4096;1=4096;2=4096;3=4096
MB:0=8192;1=8192;2=8192;3=8192
L3:0=ffff;1=ffff;2=ffff;3=ffff
In both examples the per-domain MB ceiling is higher than the global GMB ceiling. With
above showing defaults and you state "If the MBA ceiling exceeds the GLBE ceiling,
the effective MBA limit will be capped by the GLBE ceiling." - does this mean that
MB ceiling can never be higher than GMB ceiling as shown in the examples?
Another question, when setting aside possible differences between MB and GMB.
I am trying to understand how user may expect to interact with these interfaces ...
Consider the starting state example as below where the MB and GMB ceilings are the
same:
# cat schemata
GMB:0=2048;1=2048;2=2048;3=2048
MB:0=2048;1=2048;2=2048;3=2048
Would something like below be accurate? Specifically, showing how the GMB limit impacts the
MB limit:
# echo "GMB:0=8;2=8" > schemata
# cat schemata
GMB:0=8;1=2048;2=8;3=2048
MB:0=8;1=2048;2=8;3=2048
... and then when user space resets GMB the MB can reset like ...
# echo "GMB:0=2048;2=2048" > schemata
# cat schemata
GMB:0=2048;1=2048;2=2048;3=2048
MB:0=2048;1=2048;2=2048;3=2048
if I understand correctly this will only apply if the MB limit was never set so
another scenario may be to keep a previous MB setting after a GMB change:
# cat schemata
GMB:0=2048;1=2048;2=2048;3=2048
MB:0=8;1=2048;2=8;3=2048
# echo "GMB:0=8;2=8" > schemata
# cat schemata
GMB:0=8;1=2048;2=8;3=2048
MB:0=8;1=2048;2=8;3=2048
# echo "GMB:0=2048;2=2048" > schemata
# cat schemata
GMB:0=2048;1=2048;2=2048;3=2048
MB:0=8;1=2048;2=8;3=2048
What would be most intuitive way for user to interact with the interfaces?
>>> From the description it sounds as though there is a new "memory bandwidth
>> ceiling/limit" that seems to imply that MBA allocations are limited by
>> GMBA allocations while the proposed user interface present them as independent.
>>
>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>> enumerated separately, under which scenario will GMBA and MBA support different
>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>
> I can see the following scenarios where MBA and GMBA can operate independently:
> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
> I hope this clarifies your question.
No. When enumerating the features the number of CLOSID supported by each is
enumerated separately. That means GMBA and MBA may support different number of CLOSID.
My question is: "under which scenario will GMBA and MBA support different CLOSID?"
Because of a possible difference in number of CLOSIDs it seems the feature supports possible
scenarios where some resource groups can support global AND per-domain limits while other
resource groups can just support global or just support per-domain limits. Is this correct?
>> can be seen as a single "resource" that can be allocated differently based on
>> the various schemata associated with that resource. This currently has a
>> dependency on the various schemata supporting the same number of CLOSID which
>> may be something that we can reconsider?
>
> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
The new approach is not final so please provide feedback to help improve it so
that the features you are enabling can be supported well.
Reinette
[1] https://lore.kernel.org/lkml/d58f70592a4ce89e744e7378e49d5a36be3fd05e.1769029977.git.babu.moger@amd.com/
[2] https://lore.kernel.org/lkml/e0c79c53-489d-47bf-89b9-f1bb709316c6@amd.com/
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure
2026-02-11 15:19 ` Ben Horgan
@ 2026-02-11 16:54 ` Reinette Chatre
2026-02-11 17:48 ` Ben Horgan
2026-02-13 15:50 ` Moger, Babu
1 sibling, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-11 16:54 UTC (permalink / raw)
To: Ben Horgan, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Ben,
On 2/11/26 7:19 AM, Ben Horgan wrote:
> Hi Babu,
>
> On 1/21/26 21:12, Babu Moger wrote:
>> Add plza_capable field to the rdt_resource structure to indicate whether
>> Privilege Level Zero Association (PLZA) is supported for that resource
>> type.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++++
>> include/linux/resctrl.h | 3 +++
>> 3 files changed, 14 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 2de3140dd6d1..e41fe5fa3f30 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -295,6 +295,9 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
>>
>> r->alloc_capable = true;
>>
>> + if (rdt_cpu_has(X86_FEATURE_PLZA))
>> + r->plza_capable = true;
>> +
>> return true;
>> }
>>
>> @@ -314,6 +317,9 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>> if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
>> r->cache.arch_has_sparse_bitmasks = ecx.split.noncont;
>> r->alloc_capable = true;
>> +
>> + if (rdt_cpu_has(X86_FEATURE_PLZA))
>> + r->plza_capable = true;
>> }
>>
>> static void rdt_get_cdp_config(int level)
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 885026468440..540e1e719d7f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -229,6 +229,11 @@ bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>> return rdt_resources_all[l].cdp_enabled;
>> }
>>
>> +bool resctrl_arch_get_plza_capable(enum resctrl_res_level l)
>> +{
>> + return rdt_resources_all[l].r_resctrl.plza_capable;
>> +}
>> +
>> void resctrl_arch_reset_all_ctrls(struct rdt_resource *r)
>> {
>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index 63d74c0dbb8f..ae252a0e6d92 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -319,6 +319,7 @@ struct resctrl_mon {
>> * @name: Name to use in "schemata" file.
>> * @schema_fmt: Which format string and parser is used for this schema.
>> * @cdp_capable: Is the CDP feature available on this resource
>> + * @plza_capable: Is Privilege Level Zero Association capable?
>> */
>> struct rdt_resource {
>> int rid;
>> @@ -334,6 +335,7 @@ struct rdt_resource {
>> char *name;
>> enum resctrl_schema_fmt schema_fmt;
>> bool cdp_capable;
>> + bool plza_capable;
>
> Why are you making plza a resource property? Certainly for MPAM we'd
> want this to be global across resources and I see above that you are
> just checking a cpu property rather then anything per resource.
I agree. For reference: https://lore.kernel.org/lkml/6fe647ce-2e65-45dd-9c79-d1c2cb0991fe@intel.com/
One possible concern for MPAM related to this caught my eye. From
https://lore.kernel.org/lkml/20260203214342.584712-10-ben.horgan@arm.com/ :
If an SMCU is not shared with other cpus then it is implementation
defined whether the configuration from MPAMSM_EL1 is used or that from
the appropriate MPAMy_ELx. As we set the same, PMG_D and PARTID_D,
configuration for MPAM0_EL1, MPAM1_EL1 and MPAMSM_EL1 the resulting
configuration is the same regardless.
I admit that I am not yet comfortable with the MPAM register usages ... but from
above it sounds to me as though if resctrl associates different CLOSID/PARTID and
RMID/PMG with a task to be used at different privilege levels as planned with this
work then the mapping to MPAM0_EL1 and MPAM1_EL1 may be easy but MPAMSM_EL1 may be
difficult?
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure
2026-02-11 16:54 ` Reinette Chatre
@ 2026-02-11 17:48 ` Ben Horgan
0 siblings, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-11 17:48 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu, Reinette,
On 2/11/26 16:54, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/11/26 7:19 AM, Ben Horgan wrote:
>> Hi Babu,
>>
>> On 1/21/26 21:12, Babu Moger wrote:
>>> Add plza_capable field to the rdt_resource structure to indicate whether
>>> Privilege Level Zero Association (PLZA) is supported for that resource
>>> type.
>>>
>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>> ---
>>> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
>>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++++
>>> include/linux/resctrl.h | 3 +++
>>> 3 files changed, 14 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index 2de3140dd6d1..e41fe5fa3f30 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -295,6 +295,9 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
>>>
>>> r->alloc_capable = true;
>>>
>>> + if (rdt_cpu_has(X86_FEATURE_PLZA))
>>> + r->plza_capable = true;
>>> +
>>> return true;
>>> }
>>>
>>> @@ -314,6 +317,9 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>>> if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
>>> r->cache.arch_has_sparse_bitmasks = ecx.split.noncont;
>>> r->alloc_capable = true;
>>> +
>>> + if (rdt_cpu_has(X86_FEATURE_PLZA))
>>> + r->plza_capable = true;
>>> }
>>>
>>> static void rdt_get_cdp_config(int level)
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index 885026468440..540e1e719d7f 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -229,6 +229,11 @@ bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>>> return rdt_resources_all[l].cdp_enabled;
>>> }
>>>
>>> +bool resctrl_arch_get_plza_capable(enum resctrl_res_level l)
>>> +{
>>> + return rdt_resources_all[l].r_resctrl.plza_capable;
>>> +}
>>> +
>>> void resctrl_arch_reset_all_ctrls(struct rdt_resource *r)
>>> {
>>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 63d74c0dbb8f..ae252a0e6d92 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -319,6 +319,7 @@ struct resctrl_mon {
>>> * @name: Name to use in "schemata" file.
>>> * @schema_fmt: Which format string and parser is used for this schema.
>>> * @cdp_capable: Is the CDP feature available on this resource
>>> + * @plza_capable: Is Privilege Level Zero Association capable?
>>> */
>>> struct rdt_resource {
>>> int rid;
>>> @@ -334,6 +335,7 @@ struct rdt_resource {
>>> char *name;
>>> enum resctrl_schema_fmt schema_fmt;
>>> bool cdp_capable;
>>> + bool plza_capable;
>>
>> Why are you making plza a resource property? Certainly for MPAM we'd
>> want this to be global across resources and I see above that you are
>> just checking a cpu property rather then anything per resource.
>
> I agree. For reference: https://lore.kernel.org/lkml/6fe647ce-2e65-45dd-9c79-d1c2cb0991fe@intel.com/
Ah, didn't mean to duplicate. Glad we agree.
> > One possible concern for MPAM related to this caught my eye. From
> https://lore.kernel.org/lkml/20260203214342.584712-10-ben.horgan@arm.com/ :
>
> If an SMCU is not shared with other cpus then it is implementation
> defined whether the configuration from MPAMSM_EL1 is used or that from
> the appropriate MPAMy_ELx. As we set the same, PMG_D and PARTID_D,
> configuration for MPAM0_EL1, MPAM1_EL1 and MPAMSM_EL1 the resulting
> configuration is the same regardless.
>
> I admit that I am not yet comfortable with the MPAM register usages ... but from
> above it sounds to me as though if resctrl associates different CLOSID/PARTID and
> RMID/PMG with a task to be used at different privilege levels as planned with this
> work then the mapping to MPAM0_EL1 and MPAM1_EL1 may be easy but MPAMSM_EL1 may be
> difficult?
Thanks for bringing this up. The kernel has limited usage of the SMCU.
The SMCU performs matrix and simd instructions for the cpu. In the
kernel these are just used for save/restore of the simd/matrix register
state at context switch and possibly in the future usage could be
extended in a similar way to old style simd, neon, and be guarded by
something like neon_begin(), neon_end(). If we wish to use kernel
specific pmg/partids for those load/stores we can copy the MPAM1_EL1
configuration into MPAMSM_EL1. (Then it doesn't matter if the
configuration from MPAMSM_EL1 or MPAM1_EL1 is used.) This is analogous
to how we copy MPAM1_EL1 to MPAM2_EL2 to provide a configuration for the
kvm nvhe hypervisor.
See:
https://lore.kernel.org/kvmarm/9a8a163e-887a-45fc-aae5-45e564360c8b@arm.com/T/#m23281370dbcdaca98482769de1eae496afadc3b0
>
>
> Reinette
>
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-11 16:40 ` Ben Horgan
@ 2026-02-11 19:46 ` Luck, Tony
2026-02-11 22:22 ` Reinette Chatre
1 sibling, 0 replies; 114+ messages in thread
From: Luck, Tony @ 2026-02-11 19:46 UTC (permalink / raw)
To: Ben Horgan
Cc: Reinette Chatre, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
On Wed, Feb 11, 2026 at 04:40:32PM +0000, Ben Horgan wrote:
> Hi,
>
> Thanks for including me.
>
> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
> > +Ben and Drew
> >
> > On 2/10/26 8:17 AM, Reinette Chatre wrote:
> > > Hi Babu,
> > >
> > > On 1/28/26 9:44 AM, Moger, Babu wrote:
> > >>
> > >>
> > >> On 1/28/2026 11:41 AM, Moger, Babu wrote:
> > >>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> > >>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
> > >>>> Babu,
> > >>>>
> > >>>> I've read a bit more of the code now and I think I understand more.
> > >>>>
> > >>>> Some useful additions to your explanation.
> > >>>>
> > >>>> 1) Only one CTRL group can be marked as PLZA
> > >>>
> > >>> Yes. Correct.
> > >
> > > Why limit it to one CTRL_MON group and why not support it for MON groups?
> > >
> > > Limiting it to a single CTRL group seems restrictive in a few ways:
> > > 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> > > number of use cases that can be supported. Consider, for example, an existing
> > > "high priority" resource group and a "low priority" resource group. The user may
> > > just want to let the tasks in the "low priority" resource group run as "high priority"
> > > when in CPL0. This of course may depend on what resources are allocated, for example
> > > cache may need more care, but if, for example, user is only interested in memory
> > > bandwidth allocation this seems a reasonable use case?
> > > 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> > > capable of in terms of number of different control groups/CLOSID that can be
> > > assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> > > 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> > > MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> > > example, create a resource group that contains tasks of interest and create
> > > a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> > > This will give user space better insight into system behavior and from what I can
> > > tell is supported by the feature but not enabled?
> > >
> > >>>
> > >>>> 2) It can't be the root/default group
> > >>>
> > >>> This is something I added to keep the default group in a un-disturbed,
> > >
> > > Why was this needed?
> > >
> > >>>
> > >>>> 3) It can't have sub monitor groups
> > >
> > > Why not?
> > >
> > >>>> 4) It can't be pseudo-locked
> > >>>
> > >>> Yes.
> > >>>
> > >>>>
> > >>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
> > >>>> would avoid any additional context switch overhead as the PLZA MSR would never
> > >>>> need to change.
> > >>>
> > >>> Yes. That can be one use case.
> > >>>
> > >>>>
> > >>>> If that is the case, maybe for the PLZA group we should allow user to
> > >>>> do:
> > >>>>
> > >>>> # echo '*' > tasks
> > >
> > > Dedicating a resource group to "PLZA" seems restrictive while also adding many
> > > complications since this designation makes resource group behave differently and
> > > thus the files need to get extra "treatments" to handle this "PLZA" designation.
> > >
> > > I am wondering if it will not be simpler to introduce just one new file, for example
> > > "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> > > file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> > > task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> > > resource group to manage user space and kernel space allocations while also supporting
> > > various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> > > use case where user space can create a new resource group with certain allocations but the
> > > "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> > > the resource group's allocations when in CPL0.
>
> If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
>
> >
> > It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
> > with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
> > instead of CPL0 using something like "kernel" or ... ?
>
> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
> internally and here are a few thoughts.
>
> If the user case is just that an option run all tasks with the same closid/rmid
> (partid/pmg) configuration when they are running in the kernel then I'd favour a
> mount option. The resctrl filesytem interface doesn't need to change and
> userspace software doesn't need to change. This could either take away a
> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
> policy to have the default group as the kernel group. If you use the default
> configuration, at least for MPAM, the kernel may not be running at the highest
> priority as a minimum bandwidth can be used to give a priority boost. (Once we
> have a resctrl schema for this.)
I'm a big fan of this use case. It's easy to understand why users would
want this. It avoids the issue that syscalls, page-faults, and
interrupts from a task with very limited resources will spend ages in
the kernel. Users have complained about the priority inversions that
this can cause.
It also has a simpler implementation. No changes to the context switch
code. On x86 some simple method to steal a CLOSID and configure
resources for that CLOSID.
>
> It could be useful to have something a bit more featureful though. Is there a
Many things have theoretical use cases. I'd like to hear from some
resctrl users whether they will make use of these extra features.
Babu's RFC allows for some tasks to be in the PLZA group while others
will run in kernel mode with the same resources that are granted to
the CTRL group they belong too.
Reinette asked[1] whether the PLZA mode should be extended to multiple
CTRL groups and their child CTRL_MON groups for even greater
flexibility.
[1] https://lore.kernel.org/all/7a4ea07d-88e6-4f0f-a3ce-4fd97388cec4@intel.com/
> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
> it's not a single write to move a task. If a single mapping is sufficient, then
> as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
> suggested above but rather than a task that file could hold a path to the
> CTRL_MON/MON group that provides the kernel configuraion for tasks running in
> that group. So that this can be transparent to existing software an empty string
> can mean use the current group's when in the kernel (as well as for
> userspace). A slash, /, could be used to refer to the default group. This would
> give something like the below under /sys/fs/resctrl.
>
> .
> ├── cpus
> ├── tasks
> ├── ctrl1
> │ ├── cpus
> │ ├── kernel_group -> mon_groups/mon1
> │ └── tasks
> ├── kernel_group -> ctrl1
> └── mon_groups
> └── mon1
> ├── cpus
> ├── kernel_group -> ctrl1
> └── tasks
>
> >
> > I have not read anything about the RISC-V side of this yet.
> >
> > Reinette
> >
> > >
> > > Reinette
> > >
> > > [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
> >
>
> Thanks,
>
> Ben
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-11 16:54 ` Reinette Chatre
@ 2026-02-11 21:18 ` Babu Moger
2026-02-12 3:51 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-02-11 21:18 UTC (permalink / raw)
To: Reinette Chatre, Moger, Babu, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/11/26 10:54, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/10/26 5:07 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>>
>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>> On AMD systems, the existing MBA feature allows the user to set a bandwidth
>>>> limit for each QOS domain. However, multiple QOS domains share system
>>>> memory bandwidth as a resource. In order to ensure that system memory
>>>> bandwidth is not over-utilized, user must statically partition the
>>>> available system bandwidth between the active QOS domains. This typically
>>> How do you define "active" QoS Domain?
>> Some domains may not have any CPUs associated with that CLOSID. Active meant, I'm referring to domains that have CPUs assigned to the CLOSID.
> To confirm, is this then specific to assigning CPUs to resource groups via
> the cpus/cpus_list files? This refers to how a user needs to partition
> available bandwidth so I am still trying to understand the message here since
> users still need to do this even when CPUs are not assigned to resource
> groups.
>
It is not specific to CPU assignment. It applies to task assignment also.
For example: We have 4 domains;
# cat schemata
MB:0=8192;1=8192;2=8192;3=8192
If this group has the CPUs assigned to only first two domains. Then the
group has only two active domains. Then we will only update the first
two domains. The MB values in other domains does not matter.
#echo "MB:0=8;1=8" > schemata
# cat schemata
MB:0=8;1=8;2=8192;3=8192
The combined bandwidth can go up to 16(8+8) units. Each unit is 1/8 GB.
With GMBA, we can set the combined limit higher level and total
bandwidth will not exceed GMBA limit.
>>>> results in system memory being under-utilized since not all QOS domains are
>>>> using their full bandwidth Allocation.
>>>>
>>>> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
>>>> for software to specify bandwidth limits for groups of threads that span
>>>> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
>>>> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
>>>> in GLBE control domain. Bandwidth is shared by all threads in a Class of
>>>> Service(COS) across every QoS domain managed by the GLBE control domain.
>>> How does this bandwidth allocation limit impact existing MBA? For example, if a
>>> system has two domains (A and B) that user space separately sets MBA
>>> allocations for while also placing both domains within a "GLBE control domain"
>>> with a different allocation, does the individual MBA allocations still matter?
>> Yes. Both ceilings are enforced at their respective levels.
>> The MBA ceiling is applied at the QoS domain level.
>> The GLBE ceiling is applied at the GLBE control domain level.
>> If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
> It sounds as though MBA and GMBA/GLBE operates within the same parameters wrt
> the limits but in examples in this series they have different limits. For example,
> in the documentation patch [1] there is this:
>
> # cat schemata
> GMB:0=2048;1=2048;2=2048;3=2048
> MB:0=4096;1=4096;2=4096;3=4096
> L3:0=ffff;1=ffff;2=ffff;3=ffff
>
> followed up with what it will look like in new generation [2]:
>
> GMB:0=4096;1=4096;2=4096;3=4096
> MB:0=8192;1=8192;2=8192;3=8192
> L3:0=ffff;1=ffff;2=ffff;3=ffff
>
> In both examples the per-domain MB ceiling is higher than the global GMB ceiling. With
> above showing defaults and you state "If the MBA ceiling exceeds the GLBE ceiling,
> the effective MBA limit will be capped by the GLBE ceiling." - does this mean that
> MB ceiling can never be higher than GMB ceiling as shown in the examples?
That is correct. There is one more information here. The MB unit is
in 1/8 GB and GMB unit is 1GB. I have added that in documentation in
patch 4.
The GMB limit defaults to max value 4096 (bit 12 set) when the new group
is created. Meaning GMB limit does not apply by default.
When setting the limits, it should be set to same value in all the
domains in GMB control domain. Having different value in each domain
results in unexpected behavior.
>
> Another question, when setting aside possible differences between MB and GMB.
>
> I am trying to understand how user may expect to interact with these interfaces ...
>
> Consider the starting state example as below where the MB and GMB ceilings are the
> same:
>
> # cat schemata
> GMB:0=2048;1=2048;2=2048;3=2048
> MB:0=2048;1=2048;2=2048;3=2048
>
> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
> MB limit:
>
> # echo "GMB:0=8;2=8" > schemata
> # cat schemata
> GMB:0=8;1=2048;2=8;3=2048
> MB:0=8;1=2048;2=8;3=2048
Yes. That is correct. It will cap the MB setting to 8. Note that we
are talking about unit differences to make it simple.
> ... and then when user space resets GMB the MB can reset like ...
>
> # echo "GMB:0=2048;2=2048" > schemata
> # cat schemata
> GMB:0=2048;1=2048;2=2048;3=2048
> MB:0=2048;1=2048;2=2048;3=2048
>
> if I understand correctly this will only apply if the MB limit was never set so
> another scenario may be to keep a previous MB setting after a GMB change:
>
> # cat schemata
> GMB:0=2048;1=2048;2=2048;3=2048
> MB:0=8;1=2048;2=8;3=2048
>
> # echo "GMB:0=8;2=8" > schemata
> # cat schemata
> GMB:0=8;1=2048;2=8;3=2048
> MB:0=8;1=2048;2=8;3=2048
>
> # echo "GMB:0=2048;2=2048" > schemata
> # cat schemata
> GMB:0=2048;1=2048;2=2048;3=2048
> MB:0=8;1=2048;2=8;3=2048
>
> What would be most intuitive way for user to interact with the interfaces?
I see that you are trying to display the effective behaviors above.
Please keep in mind that MB and GMB units differ. I recommend showing
only the values the user has explicitly configured, rather than the
effective settings, as displaying both may cause confusion.
We also need to track the previous settings so we can revert to the
earlier value when needed. The best approach is to document this
behavior clearly.
>
>>>> From the description it sounds as though there is a new "memory bandwidth
>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>> GMBA allocations while the proposed user interface present them as independent.
>>>
>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>> enumerated separately, under which scenario will GMBA and MBA support different
>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>> I can see the following scenarios where MBA and GMBA can operate independently:
>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>> I hope this clarifies your question.
> No. When enumerating the features the number of CLOSID supported by each is
> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
No. There is not such scenario.
>
> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
> scenarios where some resource groups can support global AND per-domain limits while other
> resource groups can just support global or just support per-domain limits. Is this correct?
System can support up to 16 CLOSIDs. All of them support all the
features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for
each feature. Are you suggesting to change it ?
>
>>> can be seen as a single "resource" that can be allocated differently based on
>>> the various schemata associated with that resource. This currently has a
>>> dependency on the various schemata supporting the same number of CLOSID which
>>> may be something that we can reconsider?
>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
> The new approach is not final so please provide feedback to help improve it so
> that the features you are enabling can be supported well.
Yes, I am trying. I noticed that the proposal appears to affect how the
schemata information is displayed(in info directory). It seems to
introduce additional resource information. I don't see any harm in
displaying it if it benefits certain architecture.
Thanks
Babu
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/d58f70592a4ce89e744e7378e49d5a36be3fd05e.1769029977.git.babu.moger@amd.com/
> [2] https://lore.kernel.org/lkml/e0c79c53-489d-47bf-89b9-f1bb709316c6@amd.com/
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-11 16:40 ` Ben Horgan
2026-02-11 19:46 ` Luck, Tony
@ 2026-02-11 22:22 ` Reinette Chatre
2026-02-12 13:55 ` Ben Horgan
1 sibling, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-11 22:22 UTC (permalink / raw)
To: Ben Horgan
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Ben,
On 2/11/26 8:40 AM, Ben Horgan wrote:
> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>> On 2/10/26 8:17 AM, Reinette Chatre wrote:
>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>
>>>>
>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>> Babu,
>>>>>>
>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>
>>>>>> Some useful additions to your explanation.
>>>>>>
>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>
>>>>> Yes. Correct.
>>>
>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>>>
>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>> number of use cases that can be supported. Consider, for example, an existing
>>> "high priority" resource group and a "low priority" resource group. The user may
>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>> cache may need more care, but if, for example, user is only interested in memory
>>> bandwidth allocation this seems a reasonable use case?
>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>> capable of in terms of number of different control groups/CLOSID that can be
>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>> example, create a resource group that contains tasks of interest and create
>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>> This will give user space better insight into system behavior and from what I can
>>> tell is supported by the feature but not enabled?
>>>
>>>>>
>>>>>> 2) It can't be the root/default group
>>>>>
>>>>> This is something I added to keep the default group in a un-disturbed,
>>>
>>> Why was this needed?
>>>
>>>>>
>>>>>> 3) It can't have sub monitor groups
>>>
>>> Why not?
>>>
>>>>>> 4) It can't be pseudo-locked
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>> need to change.
>>>>>
>>>>> Yes. That can be one use case.
>>>>>
>>>>>>
>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>> do:
>>>>>>
>>>>>> # echo '*' > tasks
>>>
>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>> complications since this designation makes resource group behave differently and
>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>>>
>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>> resource group to manage user space and kernel space allocations while also supporting
>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>> use case where user space can create a new resource group with certain allocations but the
>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>> the resource group's allocations when in CPL0.
>
> If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
That is reasonable, yes.
>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>> instead of CPL0 using something like "kernel" or ... ?
>
> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
> internally and here are a few thoughts.
>
> If the user case is just that an option run all tasks with the same closid/rmid
> (partid/pmg) configuration when they are running in the kernel then I'd favour a
> mount option. The resctrl filesytem interface doesn't need to change and
I view mount options as an interface of last resort. Why would a mount option be needed
in this case? The existence of the file used to configure the feature seems sufficient?
Also ...
I do not think resctrl should unnecessarily place constraints on what the hardware
features are capable of. As I understand, both PLZA and MPAM supports use case where
tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
This may be because I am not familiar with all the requirements here so please do
help with insight on how the hardware feature is intended to be used as it relates
to its design.
We have to be very careful when constraining a feature this much If resctrl does something
like this it essentially restricts what users could do forever.
> userspace software doesn't need to change. This could either take away a
> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
> policy to have the default group as the kernel group. If you use the default
Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
between user space and kernel. I do not see a motivation for resctrl to place such
constraint.
> configuration, at least for MPAM, the kernel may not be running at the highest
> priority as a minimum bandwidth can be used to give a priority boost. (Once we
> have a resctrl schema for this.)
>
> It could be useful to have something a bit more featureful though. Is there a
> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
> it's not a single write to move a task. If a single mapping is sufficient, then
Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
I think the MPAM approach is better and there may be opportunity to do this in a similar
way and both architectures use the same field(s) in the task_struct.
> as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
> suggested above but rather than a task that file could hold a path to the
> CTRL_MON/MON group that provides the kernel configuraion for tasks running in
> that group. So that this can be transparent to existing software an empty string
Something like this would force all tasks of a group to run with the same CLOSID/RMID
(PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
and may reduce the possible use case of this feature.
For example,
- There may be a scenario where there is a set of tasks with a particular allocation
when running in user space but when in kernel these tasks benefit from different
allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
user space with allocations from resource_groupA. While these tasks are ok with this
allocation when in user space they have different requirements when it comes to
kernel space. There may be a resource_groupB that allocates a lot of resources ("high
priority") that task 1 should use for kernel work and a resource_groupC that allocates
fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
resource_groupA:
schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
tasks when in user space: 1, 2, 3
resource_groupB:
schemata: <high priority allocations>
tasks when in kernel space: 1
resource_groupC:
schemata: <medium priority allocations>
tasks when in kernel space: 2, 3
If user space is forced to have the same tasks have the same user space and kernel
allocations then that will force user space to create additional resource groups that
will use up CLOSID/PARTID that is a scarce resource.
- There may be a scenario where the user is attempting to understand system behavior by
monitoring individual or subsets of tasks' bandwidth usage when in kernel space.
- From what I can tell PLZA also supports *different* allocations when in user vs
kernel space while using the *same* monitoring group for both. This does not seem
transferable to MPAM and would take more effort to support in resctrl but it is
a use case that the hardware enables.
When enabling a feature I would of course prefer not to add unnecessary complexity. Even so,
resctrl is expected to expose hardware capabilities to user space. There seems to be some
opinions on how user space will now and forever interact with these features that
are not clear to me so I would appreciate more insight in why these constraints are
appropriate.
Reinette
> can mean use the current group's when in the kernel (as well as for
> userspace). A slash, /, could be used to refer to the default group. This would
> give something like the below under /sys/fs/resctrl.
>
> .
> ├── cpus
> ├── tasks
> ├── ctrl1
> │ ├── cpus
> │ ├── kernel_group -> mon_groups/mon1
> │ └── tasks
> ├── kernel_group -> ctrl1
> └── mon_groups
> └── mon1
> ├── cpus
> ├── kernel_group -> ctrl1
> └── tasks
>
>>
>> I have not read anything about the RISC-V side of this yet.
>>
>> Reinette
>>
>>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>
>
> Thanks,
>
> Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group
2026-02-10 0:05 ` Reinette Chatre
@ 2026-02-11 23:10 ` Moger, Babu
0 siblings, 0 replies; 114+ messages in thread
From: Moger, Babu @ 2026-02-11 23:10 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/9/2026 6:05 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/21/26 1:12 PM, Babu Moger wrote:
>> +static ssize_t rdtgroup_plza_write(struct kernfs_open_file *of, char *buf,
>> + size_t nbytes, loff_t off)
>> +{
>> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>
> Hardcoding PLZA configuration to the L3 resource is unexpected, especially since
> PLZA's impact and configuration on MBA is mentioned a couple of times in this
> series and discussions that followed. There also does not seem to be any
> "per resource" PLZA capability but instead when system supports PLZA
> RDT_RESOURCE_L2, RDT_RESOURCE_L3, and RDT_RESOURCE_MBA are automatically (if
> resources are present) set to support it.
Yes. That is correct. If system supports PLZA, it applies to all the
resources.
>
>>From what I understand PLZA enables user space to configure CLOSID and RMID
> used in CPL=0 independent from resource. That is, when a user configures
> PLZA with this interface all allocation information for all resources in
> resource group's schemata applies.
>
> Since this implementation makes "plza" a per-resource property it makes possible
> scenarios where some resources support plza while others do not. From what I
> can tell this is not reflected by the schemata file associated with a
> "plza" resource group that continues to enable user space to change
> allocations of all resources, whether they support plza or not.
>
> Why was PLZA determined to be a per-resource property? It instead seems to
There is no specific reason. Seemed easy to access the property. Will
change it.
> have larger scope? The cycle introduced in patch #9 where the arch sets
> a per-'resctrl fs' resource property and then forces resctrl fs to query
> the arch for its own property seems unnecessary. Could this support just
> be a global property that resctrl fs can query from the arch?
Yes. That seems like a better approach.
>
>> + struct rdtgroup *rdtgrp, *prgrp;
>> + int cpu, ret = 0;
>> + bool enable;
>> +
>> + ret = kstrtobool(buf, &enable);
>> + if (ret)
>> + return ret;
>> +
>> + rdtgrp = rdtgroup_kn_lock_live(of->kn);
>> + if (!rdtgrp) {
>> + rdtgroup_kn_unlock(of->kn);
>> + return -ENOENT;
>> + }
>> +
>> + rdt_last_cmd_clear();
>> +
>> + if (!r->plza_capable) {
>> + rdt_last_cmd_puts("PLZA is not supported in the system\n");
>> + ret = -EINVAL;
>> + goto unlock;
>> + }
>> +
>> + if (rdtgrp == &rdtgroup_default) {
>> + rdt_last_cmd_puts("Cannot set PLZA on a default group\n");
>> + ret = -EINVAL;
>> + goto unlock;
>> + }
>> +
>> + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
>> + rdt_last_cmd_puts("Resource group is pseudo-locked\n");
>> + ret = -EINVAL;
>> + goto unlock;
>> + }
>> +
>> + if (!list_empty(&rdtgrp->mon.crdtgrp_list)) {
>> + rdt_last_cmd_puts("Cannot change CTRL_MON group with sub monitor groups\n");
>> + ret = -EINVAL;
>> + goto unlock;
>> + }
>
>>From what I can tell it is still possible to add monitor groups after a
> CTRL_MON group is designated "plza".
>
Good point. I missed it.
> If repurposing a CTRL_MON group to operate with different constraints we should
> take care how user can still continue to interact with existing files/directories
> as a group transitions between plza and non-plza. One option could be to hide files
> as needed to prevent user from interacting with them, another option needs to add
> extra checks on all the paths that interact with these files and directories.
Yes. Adding extra checks seemed good idea. Will do.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-11 21:18 ` Babu Moger
@ 2026-02-12 3:51 ` Reinette Chatre
2026-02-12 19:09 ` Babu Moger
` (3 more replies)
0 siblings, 4 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-12 3:51 UTC (permalink / raw)
To: Babu Moger, Moger, Babu, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 2/11/26 1:18 PM, Babu Moger wrote:
> On 2/11/26 10:54, Reinette Chatre wrote:
>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>>> On AMD systems, the existing MBA feature allows the user to set a bandwidth
>>>>> limit for each QOS domain. However, multiple QOS domains share system
>>>>> memory bandwidth as a resource. In order to ensure that system memory
>>>>> bandwidth is not over-utilized, user must statically partition the
>>>>> available system bandwidth between the active QOS domains. This typically
>>>> How do you define "active" QoS Domain?
>>> Some domains may not have any CPUs associated with that CLOSID. Active meant, I'm referring to domains that have CPUs assigned to the CLOSID.
>> To confirm, is this then specific to assigning CPUs to resource groups via
>> the cpus/cpus_list files? This refers to how a user needs to partition
>> available bandwidth so I am still trying to understand the message here since
>> users still need to do this even when CPUs are not assigned to resource
>> groups.
>>
> It is not specific to CPU assignment. It applies to task assignment also.
>
> For example: We have 4 domains;
>
> # cat schemata
> MB:0=8192;1=8192;2=8192;3=8192
>
> If this group has the CPUs assigned to only first two domains. Then the group has only two active domains. Then we will only update the first two domains. The MB values in other domains does not matter.
I see, thank you. As I understand an "active QoS domain" is something only user
space can designate. It may be possible for resctrl to get a sense of which QoS domains
are "active" when only CPUs are assigned to a resource group but when it comes to task
assignment it is user space that controls where tasks belonging to a group can be
scheduled and thus which QoS domains are "active" or not.
>
> #echo "MB:0=8;1=8" > schemata
>
> # cat schemata
> MB:0=8;1=8;2=8192;3=8192
>
> The combined bandwidth can go up to 16(8+8) units. Each unit is 1/8 GB.
>
> With GMBA, we can set the combined limit higher level and total bandwidth will not exceed GMBA limit.
Thank you for the confirmation.
>
>>>>> results in system memory being under-utilized since not all QOS domains are
>>>>> using their full bandwidth Allocation.
>>>>>
>>>>> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
>>>>> for software to specify bandwidth limits for groups of threads that span
>>>>> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
>>>>> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
>>>>> in GLBE control domain. Bandwidth is shared by all threads in a Class of
>>>>> Service(COS) across every QoS domain managed by the GLBE control domain.
>>>> How does this bandwidth allocation limit impact existing MBA? For example, if a
>>>> system has two domains (A and B) that user space separately sets MBA
>>>> allocations for while also placing both domains within a "GLBE control domain"
>>>> with a different allocation, does the individual MBA allocations still matter?
>>> Yes. Both ceilings are enforced at their respective levels.
>>> The MBA ceiling is applied at the QoS domain level.
>>> The GLBE ceiling is applied at the GLBE control domain level.
>>> If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
>> It sounds as though MBA and GMBA/GLBE operates within the same parameters wrt
>> the limits but in examples in this series they have different limits. For example,
>> in the documentation patch [1] there is this:
>>
>> # cat schemata
>> GMB:0=2048;1=2048;2=2048;3=2048
>> MB:0=4096;1=4096;2=4096;3=4096
>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>
>> followed up with what it will look like in new generation [2]:
>>
>> GMB:0=4096;1=4096;2=4096;3=4096
>> MB:0=8192;1=8192;2=8192;3=8192
>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>
>> In both examples the per-domain MB ceiling is higher than the global GMB ceiling. With
>> above showing defaults and you state "If the MBA ceiling exceeds the GLBE ceiling,
>> the effective MBA limit will be capped by the GLBE ceiling." - does this mean that
>> MB ceiling can never be higher than GMB ceiling as shown in the examples?
>
> That is correct. There is one more information here. The MB unit is in 1/8 GB and GMB unit is 1GB. I have added that in documentation in patch 4.
ah - right. I did not take the different units into account.
>
> The GMB limit defaults to max value 4096 (bit 12 set) when the new group is created. Meaning GMB limit does not apply by default.
>
> When setting the limits, it should be set to same value in all the domains in GMB control domain. Having different value in each domain results in unexpected behavior.
>
>>
>> Another question, when setting aside possible differences between MB and GMB.
>>
>> I am trying to understand how user may expect to interact with these interfaces ...
>>
>> Consider the starting state example as below where the MB and GMB ceilings are the
>> same:
>>
>> # cat schemata
>> GMB:0=2048;1=2048;2=2048;3=2048
>> MB:0=2048;1=2048;2=2048;3=2048
>>
>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>> MB limit:
>> # echo "GMB:0=8;2=8" > schemata
>> # cat schemata
>> GMB:0=8;1=2048;2=8;3=2048
>> MB:0=8;1=2048;2=8;3=2048
>
> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
Thank you for confirming.
>
>
>> ... and then when user space resets GMB the MB can reset like ...
>>
>> # echo "GMB:0=2048;2=2048" > schemata
>> # cat schemata
>> GMB:0=2048;1=2048;2=2048;3=2048
>> MB:0=2048;1=2048;2=2048;3=2048
>>
>> if I understand correctly this will only apply if the MB limit was never set so
>> another scenario may be to keep a previous MB setting after a GMB change:
>>
>> # cat schemata
>> GMB:0=2048;1=2048;2=2048;3=2048
>> MB:0=8;1=2048;2=8;3=2048
>>
>> # echo "GMB:0=8;2=8" > schemata
>> # cat schemata
>> GMB:0=8;1=2048;2=8;3=2048
>> MB:0=8;1=2048;2=8;3=2048
>>
>> # echo "GMB:0=2048;2=2048" > schemata
>> # cat schemata
>> GMB:0=2048;1=2048;2=2048;3=2048
>> MB:0=8;1=2048;2=8;3=2048
>>
>> What would be most intuitive way for user to interact with the interfaces?
>
> I see that you are trying to display the effective behaviors above.
Indeed. My goal is to get an idea how user space may interact with the new interfaces and
what would be a reasonable expectation from resctrl be during these interactions.
>
> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
hmmm ... this may be subjective. Could you please elaborate how presenting the effective
settings may cause confusion?
>
> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
Yes, this will require resctrl to maintain more state.
Documenting behavior is an option but I think we should first consider if there are things
resctrl can do to make the interface intuitive to use.
>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>
>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>> I hope this clarifies your question.
>> No. When enumerating the features the number of CLOSID supported by each is
>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
> No. There is not such scenario.
>>
>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>> scenarios where some resource groups can support global AND per-domain limits while other
>> resource groups can just support global or just support per-domain limits. Is this correct?
>
> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
It is not a concern to have different CLOSIDs between resources that are actually different,
for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
challenge though. Would it be possible to have a snippet in the spec that explicitly states
that MB and GMB will always enumerate with the same number of CLOSIDs?
Please see below where I will try to support this request more clearly and you can decide if
it is reasonable.
>>>> can be seen as a single "resource" that can be allocated differently based on
>>>> the various schemata associated with that resource. This currently has a
>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>> may be something that we can reconsider?
>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>> The new approach is not final so please provide feedback to help improve it so
>> that the features you are enabling can be supported well.
>
> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
It benefits all architectures.
There are two parts to the current proposals.
Part 1: Generic schema description
I believe there is consensus on this approach. This is actually something that is long
overdue and something like this would have been a great to have with the initial AMD
enabling. With the generic schema description forming part of resctrl the user can learn
from resctrl how to interact with the schemata file instead of relying on external information
and documentation.
For example, on an Intel system that uses percentage based proportional allocation for memory
bandwidth the new resctrl files will display:
info/MB/resource_schemata/MB/type:scalar linear
info/MB/resource_schemata/MB/unit:all
info/MB/resource_schemata/MB/scale:1
info/MB/resource_schemata/MB/resolution:100
info/MB/resource_schemata/MB/tolerance:0
info/MB/resource_schemata/MB/max:100
info/MB/resource_schemata/MB/min:10
On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
info/MB/resource_schemata/MB/type:scalar linear
info/MB/resource_schemata/MB/unit:GBps
info/MB/resource_schemata/MB/scale:1
info/MB/resource_schemata/MB/resolution:8
info/MB/resource_schemata/MB/tolerance:0
info/MB/resource_schemata/MB/max:2048
info/MB/resource_schemata/MB/min:1
Having such interface will be helpful today. Users do not need to first figure out
whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
before interacting with resctrl. resctrl will be the generic interface it intends to be.
Part 2: Supporting multiple controls for a single resource
This is a new feature on which there also appears to be consensus that is needed by MPAM and
Intel RDT where it is possible to use different controls for the same resource. For example,
there can be a minimum and maximum control associated with the memory bandwidth resource.
For example,
info/
└─ MB/
└─ resource_schemata/
├─ MB/
├─ MB_MIN/
├─ MB_MAX/
┆
Here is where the big question comes in for GLBE - is this actually a new resource
for which resctrl needs to add interfaces to manage its allocation, or is it instead
an additional control associated with the existing memory bandwith resource?
For me things are actually pointing to GLBE not being a new resource but instead being
a new control for the existing memory bandwidth resource.
I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
done in this series but when considering it as an actual unique resource does not seem
appropriate since resctrl already has a "memory bandwidth" resource. User space expects
to find all the resources that it can allocate in info/ - I do not think it is correct
to have two separate directories/resources for memory bandwidth here.
What if, instead, it looks something like:
info/
└── MB/
└── resource_schemata/
├── GMB/
│ ├── max:4096
│ ├── min:1
│ ├── resolution:1
│ ├── scale:1
│ ├── tolerance:0
│ ├── type:scalar linear
│ └── unit:GBps
└── MB/
├── max:8192
├── min:1
├── resolution:8
├── scale:1
├── tolerance:0
├── type:scalar linear
└── unit:GBps
With an interface like above GMB is just another control/schema used to allocate the
existing memory bandwidth resource. With the planned files it is possible to express the
different maximums and units used by the MB and GMB schema. Users no longer need to
dig for the unit information in the docs, it is available in the interface.
Doing something like this does depend on GLBE supporting the same number of CLOSIDs
as MB, which seems to be how this will be implemented. If there is indeed a confirmation
of this from AMD architecture then we can do something like this in resctrl.
There is a "part 3" to the proposals that attempts to address the new requirement where
some of the controls allocate at a different scope while also requiring monitoring at
that new scope. After learning more about GLBE this does not seem relevant to GLBE but is
something to return to for the "MPAM CPU-less" work. We could already prepare for this
by adding the new "scope" schema property though.
Reinette
>
> Thanks
>
> Babu
>
>
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/d58f70592a4ce89e744e7378e49d5a36be3fd05e.1769029977.git.babu.moger@amd.com/
>> [2] https://lore.kernel.org/lkml/e0c79c53-489d-47bf-89b9-f1bb709316c6@amd.com/
>>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-01-28 16:01 ` Moger, Babu
2026-01-28 17:12 ` Luck, Tony
@ 2026-02-12 10:00 ` Ben Horgan
1 sibling, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-12 10:00 UTC (permalink / raw)
To: Moger, Babu, Luck, Tony, Babu Moger
Cc: corbet, reinette.chatre, Dave.Martin, james.morse, tglx, mingo,
bp, dave.hansen, x86, hpa, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, akpm,
pawan.kumar.gupta, pmladek, feng.tang, kees, arnd, fvdl,
lirongqing, bhelgaas, seanjc, xin, manali.shukla, dapeng1.mi,
chang.seok.bae, mario.limonciello, naveen, elena.reshetova,
thomas.lendacky, linux-doc, linux-kernel, kvm, peternewman,
eranian, gautham.shenoy
Hi Babu,
On 1/28/26 16:01, Moger, Babu wrote:
> Hi Tony,
>
> Thanks for the comment.
>
> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>> On Wed, Jan 21, 2026 at 03:12:51PM -0600, Babu Moger wrote:
>>> @@ -138,6 +143,20 @@ static inline void __resctrl_sched_in(struct
>>> task_struct *tsk)
>>> state->cur_rmid = rmid;
>>> wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
>>> }
>>> +
>>> + if (static_branch_likely(&rdt_plza_enable_key)) {
>>> + tmp = READ_ONCE(tsk->plza);
>>> + if (tmp)
>>> + plza = tmp;
>>> +
>>> + if (plza != state->cur_plza) {
>>> + state->cur_plza = plza;
>>> + wrmsr(MSR_IA32_PQR_PLZA_ASSOC,
>>> + RMID_EN | state->plza_rmid,
>>> + (plza ? PLZA_EN : 0) | CLOSID_EN | state-
>>> >plza_closid);
>>> + }
>>> + }
>>> +
>>
>> Babu,
>>
>> This addition to the context switch code surprised me. After your talk
>> at LPC I had imagined that PLZA would be a single global setting so that
>> every syscall/page-fault/interrupt would run with a different CLOSID
>> (presumably one configured with more cache and memory bandwidth).
>>
>> But this patch series looks like things are more flexible with the
>> ability to set different values (of RMID as well as CLOSID) per group.
>
> Yes. this similar what we have with MSR_IA32_PQR_ASSOC. The association
> can be done either thru CPUs (just one MSR write) or task based
> association(more MSR write as task moves around).
>>
>> It looks like it is possible to have some resctrl group with very
>> limited resources just bump up a bit when in ring0, while other
>> groups may get some different amount.
>>
>> The additions for plza to the Documentation aren't helping me
>> understand how users will apply this.
>>
>> Do you have some more examples?
>
> Group creation is similar to what we have currently.
>
> 1. create a regular group and setup the limits.
> # mkdir /sys/fs/resctrl/group
>
> 2. Assign tasks or CPUs.
> # echo 1234 > /sys/fs/resctrl/group/tasks
>
> This is a regular group.
>
> 3. Now you figured that you need to change things in CPL0 for this task.
>
> 4. Now create a PLZA group now and tweek the limits,
>
> # mkdir /sys/fs/resctrl/group1
>
> # echo 1 > /sys/fs/resctrl/group1/plza
>
> # echo "MB:0=100" > /sys/fs/resctrl/group1/schemata
>
> 5. Assign the same task to the plza group.
>
> # echo 1234 > /sys/fs/resctrl/group1/tasks
Reusing 'tasks' files for kernel configuration risks confusing existing
user space tools that don't know about the new plza option. E.g. this
may be a problem if the user manually set the plza and then tried to use
their existing tool for understanding or configuring resctrl settings.
>
>
> Now the task 1234 will be using the limits from group1 when running in
> CPL0.
>
> I will add few more details in my next revision.
>
> Thanks
> Babu
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-11 22:22 ` Reinette Chatre
@ 2026-02-12 13:55 ` Ben Horgan
2026-02-12 18:37 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-12 13:55 UTC (permalink / raw)
To: Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Reinette, Tony, Babu,
On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/11/26 8:40 AM, Ben Horgan wrote:
> > On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
> >> On 2/10/26 8:17 AM, Reinette Chatre wrote:
> >>> On 1/28/26 9:44 AM, Moger, Babu wrote:
> >>>>
> >>>>
> >>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
> >>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> >>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
> >>>>>> Babu,
> >>>>>>
> >>>>>> I've read a bit more of the code now and I think I understand more.
> >>>>>>
> >>>>>> Some useful additions to your explanation.
> >>>>>>
> >>>>>> 1) Only one CTRL group can be marked as PLZA
> >>>>>
> >>>>> Yes. Correct.
> >>>
> >>> Why limit it to one CTRL_MON group and why not support it for MON groups?
> >>>
> >>> Limiting it to a single CTRL group seems restrictive in a few ways:
> >>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> >>> number of use cases that can be supported. Consider, for example, an existing
> >>> "high priority" resource group and a "low priority" resource group. The user may
> >>> just want to let the tasks in the "low priority" resource group run as "high priority"
> >>> when in CPL0. This of course may depend on what resources are allocated, for example
> >>> cache may need more care, but if, for example, user is only interested in memory
> >>> bandwidth allocation this seems a reasonable use case?
> >>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> >>> capable of in terms of number of different control groups/CLOSID that can be
> >>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> >>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> >>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> >>> example, create a resource group that contains tasks of interest and create
> >>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> >>> This will give user space better insight into system behavior and from what I can
> >>> tell is supported by the feature but not enabled?
> >>>
> >>>>>
> >>>>>> 2) It can't be the root/default group
> >>>>>
> >>>>> This is something I added to keep the default group in a un-disturbed,
> >>>
> >>> Why was this needed?
> >>>
> >>>>>
> >>>>>> 3) It can't have sub monitor groups
> >>>
> >>> Why not?
> >>>
> >>>>>> 4) It can't be pseudo-locked
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>>
> >>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
> >>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
> >>>>>> need to change.
> >>>>>
> >>>>> Yes. That can be one use case.
> >>>>>
> >>>>>>
> >>>>>> If that is the case, maybe for the PLZA group we should allow user to
> >>>>>> do:
> >>>>>>
> >>>>>> # echo '*' > tasks
> >>>
> >>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
> >>> complications since this designation makes resource group behave differently and
> >>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
As I commented on another thread, I'm wary of this reuse of existing file types
as they can confuse existing user-space tools.
> >>>
> >>> I am wondering if it will not be simpler to introduce just one new file, for example
> >>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> >>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> >>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> >>> resource group to manage user space and kernel space allocations while also supporting
> >>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> >>> use case where user space can create a new resource group with certain allocations but the
> >>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> >>> the resource group's allocations when in CPL0.
> >
> > If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
>
> That is reasonable, yes.
I think the "tasks_cpl0" approach suffers from one of the same faults as the
"kernel_groups" approach. If you want to run a task with userspace configuration
closid-A rmid-Y but to run in kernel space in closid-B but the same rmid-Y then
there can't exist monitor_group in resctrl for both.
>
> >> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
> >> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
> >> instead of CPL0 using something like "kernel" or ... ?
> >
> > Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
> > internally and here are a few thoughts.
> >
> > If the user case is just that an option run all tasks with the same closid/rmid
> > (partid/pmg) configuration when they are running in the kernel then I'd favour a
> > mount option. The resctrl filesytem interface doesn't need to change and
>
> I view mount options as an interface of last resort. Why would a mount option be needed
> in this case? The existence of the file used to configure the feature seems sufficient?
If we are taking away a closid from the user then the number of CTRL_MON groups
that can be created changes. It seems reasonable for user-space to expect
num_closid to be a fixed value.
>
> Also ...
>
> I do not think resctrl should unnecessarily place constraints on what the hardware
> features are capable of. As I understand, both PLZA and MPAM supports use case where
> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
> This may be because I am not familiar with all the requirements here so please do
> help with insight on how the hardware feature is intended to be used as it relates
> to its design.
>
> We have to be very careful when constraining a feature this much If resctrl does something
> like this it essentially restricts what users could do forever.
Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
fixed kernel CLOSID/RMID configuration option might just give all we need for
usecases we know we have and be minimally intrusive enough to not preclude a
more featureful PLZA later when new usecases come about.
One complication is that for fixed kernel CLOSID/RMID option is that for x86 you
may want to be able to monitor a tasks resource usage whether or not it is in
the kernel or userspace and so only have a fixed CLOSID. However, for MPAM this
wouldn't work as PMG (~RMID) is scoped to PARTID (~CLOSID).
>
> > userspace software doesn't need to change. This could either take away a
> > closid/rmid from userspace and dedicate it to the kernel or perhaps have a
> > policy to have the default group as the kernel group. If you use the default
>
> Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
> between user space and kernel. I do not see a motivation for resctrl to place such
> constraint.
>
> > configuration, at least for MPAM, the kernel may not be running at the highest
> > priority as a minimum bandwidth can be used to give a priority boost. (Once we
> > have a resctrl schema for this.)
> >
> > It could be useful to have something a bit more featureful though. Is there a
> > need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
> > would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
> > it's not a single write to move a task. If a single mapping is sufficient, then
>
> Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
> I think the MPAM approach is better and there may be opportunity to do this in a similar
> way and both architectures use the same field(s) in the task_struct.
I was referring to the userspace file write but unifying on a the same fields in
task_struct could be good. The single write is necessary for MPAM as PMG is
scoped to PARTID and I don't think x86 behaviour changes if it moves to the same
approach.
>
> > as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
> > suggested above but rather than a task that file could hold a path to the
> > CTRL_MON/MON group that provides the kernel configuraion for tasks running in
> > that group. So that this can be transparent to existing software an empty string
>
> Something like this would force all tasks of a group to run with the same CLOSID/RMID
> (PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
> and may reduce the possible use case of this feature.
>
> For example,
> - There may be a scenario where there is a set of tasks with a particular allocation
> when running in user space but when in kernel these tasks benefit from different
> allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
> user space with allocations from resource_groupA. While these tasks are ok with this
> allocation when in user space they have different requirements when it comes to
> kernel space. There may be a resource_groupB that allocates a lot of resources ("high
> priority") that task 1 should use for kernel work and a resource_groupC that allocates
> fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
>
> resource_groupA:
> schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
> tasks when in user space: 1, 2, 3
>
> resource_groupB:
> schemata: <high priority allocations>
> tasks when in kernel space: 1
>
> resource_groupC:
> schemata: <medium priority allocations>
> tasks when in kernel space: 2, 3
I'm not sure if this would happen in the real world or not.
>
> If user space is forced to have the same tasks have the same user space and kernel
> allocations then that will force user space to create additional resource groups that
> will use up CLOSID/PARTID that is a scarce resource.
This may be undesirable even if CLOSID/PARTID were unlimited as controls which set
a per-CLOSID/PARTID maximum don't have the same effect if the tasks are spread across
more than one CLOSID/PARTID.
>
> - There may be a scenario where the user is attempting to understand system behavior by
> monitoring individual or subsets of tasks' bandwidth usage when in kernel space.
This seems useful to me.
>
> - From what I can tell PLZA also supports *different* allocations when in user vs
> kernel space while using the *same* monitoring group for both. This does not seem
> transferable to MPAM and would take more effort to support in resctrl but it is
> a use case that the hardware enables.
Ah yes, I think this ends the 'kernel_group' idea then. I was too focused on
MPAM and forgotten to consider the case where PMG and PARTID are independent.
>
> When enabling a feature I would of course prefer not to add unnecessary complexity. Even so,
> resctrl is expected to expose hardware capabilities to user space. There seems to be some
> opinions on how user space will now and forever interact with these features that
> are not clear to me so I would appreciate more insight in why these constraints are
> appropriate.
Yes, care definitely needs to be taken here in order to not back ourselves into
a corner.
>
> Reinette
>
> > can mean use the current group's when in the kernel (as well as for
> > userspace). A slash, /, could be used to refer to the default group. This would
> > give something like the below under /sys/fs/resctrl.
> >
> > .
> > ├── cpus
> > ├── tasks
> > ├── ctrl1
> > │ ├── cpus
> > │ ├── kernel_group -> mon_groups/mon1
> > │ └── tasks
> > ├── kernel_group -> ctrl1
> > └── mon_groups
> > └── mon1
> > ├── cpus
> > ├── kernel_group -> ctrl1
> > └── tasks
> >
> >>
> >> I have not read anything about the RISC-V side of this yet.
> >>
> >> Reinette
> >>
> >>>
> >>> Reinette
> >>>
> >>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
> >>
> >
> > Thanks,
> >
> > Ben
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-12 13:55 ` Ben Horgan
@ 2026-02-12 18:37 ` Reinette Chatre
2026-02-16 15:18 ` Ben Horgan
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-12 18:37 UTC (permalink / raw)
To: Ben Horgan
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Ben,
On 2/12/26 5:55 AM, Ben Horgan wrote:
> On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
>> On 2/11/26 8:40 AM, Ben Horgan wrote:
>>> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>>>> On 2/10/26 8:17 AM, Reinette Chatre wrote:
>>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>>
>>>>>>
>>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>>> Babu,
>>>>>>>>
>>>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>>>
>>>>>>>> Some useful additions to your explanation.
>>>>>>>>
>>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>>
>>>>>>> Yes. Correct.
>>>>>
>>>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>>>>>
>>>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>>>> number of use cases that can be supported. Consider, for example, an existing
>>>>> "high priority" resource group and a "low priority" resource group. The user may
>>>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>>>> cache may need more care, but if, for example, user is only interested in memory
>>>>> bandwidth allocation this seems a reasonable use case?
>>>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>>>> capable of in terms of number of different control groups/CLOSID that can be
>>>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>>>> example, create a resource group that contains tasks of interest and create
>>>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>>>> This will give user space better insight into system behavior and from what I can
>>>>> tell is supported by the feature but not enabled?
>>>>>
>>>>>>>
>>>>>>>> 2) It can't be the root/default group
>>>>>>>
>>>>>>> This is something I added to keep the default group in a un-disturbed,
>>>>>
>>>>> Why was this needed?
>>>>>
>>>>>>>
>>>>>>>> 3) It can't have sub monitor groups
>>>>>
>>>>> Why not?
>>>>>
>>>>>>>> 4) It can't be pseudo-locked
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>>>> need to change.
>>>>>>>
>>>>>>> Yes. That can be one use case.
>>>>>>>
>>>>>>>>
>>>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>>>> do:
>>>>>>>>
>>>>>>>> # echo '*' > tasks
>>>>>
>>>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>>>> complications since this designation makes resource group behave differently and
>>>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>
> As I commented on another thread, I'm wary of this reuse of existing file types
> as they can confuse existing user-space tools.
I agree. Changing how user space interacts with existing files is a change that would
require a mount option and this can be avoided by using new files instead.
>>>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>>>> resource group to manage user space and kernel space allocations while also supporting
>>>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>>>> use case where user space can create a new resource group with certain allocations but the
>>>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>>>> the resource group's allocations when in CPL0.
>>>
>>> If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
>>
>> That is reasonable, yes.
>
> I think the "tasks_cpl0" approach suffers from one of the same faults as the
> "kernel_groups" approach. If you want to run a task with userspace configuration
> closid-A rmid-Y but to run in kernel space in closid-B but the same rmid-Y then
> there can't exist monitor_group in resctrl for both.
This assumes that "tasks" and "tasks_cpl0"/"tasks_kernel" have the same rules for
task assignment. When a user assigns a task to the "tasks" file of a MON group it
is required that the task is a member of the parent CTRL_MON group and if so, that
task's CLOSID and RMID are both updated. Theoretically there could be different rules
for task assignment to the "tasks_cpl0"/"tasks_kernel" file that does not place such
restriction and only updates CLOSID when moving to a CTRL_MON group and only updates
RMID when moving to a MON group.
You are correct that resctrl cannot have monitor groups to track such configuration
and there may indeed be some consequences that I have not considered.
I understand this is not something that MPAM can support and I also do not know if this
is even a valid use case. If doing something like this user space will need to take care
since the monitoring data will be presented with the allocations used when tasks are in
user space but also contain the monitoring data for allocations used when tasks are in
kernel space that are tracked in another control group hierarchy (to which I expect the
task's kernel space monitoring can move when the MON group is deleted).
>>>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>>>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>>>> instead of CPL0 using something like "kernel" or ... ?
>>>
>>> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
>>> internally and here are a few thoughts.
>>>
>>> If the user case is just that an option run all tasks with the same closid/rmid
>>> (partid/pmg) configuration when they are running in the kernel then I'd favour a
>>> mount option. The resctrl filesytem interface doesn't need to change and
>>
>> I view mount options as an interface of last resort. Why would a mount option be needed
>> in this case? The existence of the file used to configure the feature seems sufficient?
>
> If we are taking away a closid from the user then the number of CTRL_MON groups
> that can be created changes. It seems reasonable for user-space to expect
> num_closid to be a fixed value.
I do you see why we need to take away a CLOSID from the user. Consider a user space that
runs with just two resource groups, for example, "high priority" and "low priority", it seems
reasonable to make it possible to let the "low priority" tasks run with "high priority"
allocations when in kernel space without needing to dedicate a new CLOSID? More reasonable
when only considering memory bandwidth allocation though.
>
>>
>> Also ...
>>
>> I do not think resctrl should unnecessarily place constraints on what the hardware
>> features are capable of. As I understand, both PLZA and MPAM supports use case where
>> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
>> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
>> This may be because I am not familiar with all the requirements here so please do
>> help with insight on how the hardware feature is intended to be used as it relates
>> to its design.
>>
>> We have to be very careful when constraining a feature this much If resctrl does something
>> like this it essentially restricts what users could do forever.
>
> Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
> fixed kernel CLOSID/RMID configuration option might just give all we need for
> usecases we know we have and be minimally intrusive enough to not preclude a
> more featureful PLZA later when new usecases come about.
Having ability to grow features would be ideal. I do not see how a fixed kernel CLOSID/RMID
configuration leaves room to build on top though. Could you please elaborate?
I wonder if the benefit of the fixed CLOSID/RMID is perhaps mostly in the cost of
context switching which I do not think is a concern for MPAM but it may be for PLZA?
One option to support fixed kernel CLOSID/RMID at the beginning and leave room to build
may be to create the kernel_group or "tasks_kernel" interface as a baseline but in first
implementation only allow user space to write the same group to all "kernel_group" files or
to only allow to write to one of the "tasks_kernel" files in the resctrl fs hierarchy. At
that time the associated CLOSID/RMID would become the "fixed configuration" and attempts to
write to others can return "ENOSPC"?
From what I can tell this still does not require to take away a CLOSID/RMID from user space
though. Dedicating a CLOSID/RMID to kernel work can still be done but be in control of user
that can, for example leave the "tasks" and "cpus" files empty.
> One complication is that for fixed kernel CLOSID/RMID option is that for x86 you
> may want to be able to monitor a tasks resource usage whether or not it is in
> the kernel or userspace and so only have a fixed CLOSID. However, for MPAM this
> wouldn't work as PMG (~RMID) is scoped to PARTID (~CLOSID).
>
>>
>>> userspace software doesn't need to change. This could either take away a
>>> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
>>> policy to have the default group as the kernel group. If you use the default
>>
>> Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
>> between user space and kernel. I do not see a motivation for resctrl to place such
>> constraint.
>>
>>> configuration, at least for MPAM, the kernel may not be running at the highest
>>> priority as a minimum bandwidth can be used to give a priority boost. (Once we
>>> have a resctrl schema for this.)
>>>
>>> It could be useful to have something a bit more featureful though. Is there a
>>> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
>>> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
>>> it's not a single write to move a task. If a single mapping is sufficient, then
>>
>> Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
>> I think the MPAM approach is better and there may be opportunity to do this in a similar
>> way and both architectures use the same field(s) in the task_struct.
>
> I was referring to the userspace file write but unifying on a the same fields in
> task_struct could be good. The single write is necessary for MPAM as PMG is
> scoped to PARTID and I don't think x86 behaviour changes if it moves to the same
> approach.
>
ah - I misunderstood. You are suggesting to have one file that user writes to
to set both user space and kernel space CLOSID/RMID? This sounds like what the
existing "tasks" file does but only supports the same CLOSID/RMID for both user
space and kernel space. To support the new hardware features where the CLOSID/RMID
can be different we cannot just change "tasks" interface and would need to keep it
backward compatible. So far I assumed that it would be ok for the "tasks" file
to essentially get new meaning as the CLOSID/RMID for just user space work, which
seems to require a second file for kernel space as a consequence? So far I have
not seen an option that does not change meaning of the "tasks" file.
>>> as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
>>> suggested above but rather than a task that file could hold a path to the
>>> CTRL_MON/MON group that provides the kernel configuraion for tasks running in
>>> that group. So that this can be transparent to existing software an empty string
>>
>> Something like this would force all tasks of a group to run with the same CLOSID/RMID
>> (PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
>> and may reduce the possible use case of this feature.
>>
>> For example,
>> - There may be a scenario where there is a set of tasks with a particular allocation
>> when running in user space but when in kernel these tasks benefit from different
>> allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
>> user space with allocations from resource_groupA. While these tasks are ok with this
>> allocation when in user space they have different requirements when it comes to
>> kernel space. There may be a resource_groupB that allocates a lot of resources ("high
>> priority") that task 1 should use for kernel work and a resource_groupC that allocates
>> fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
>>
>> resource_groupA:
>> schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
>> tasks when in user space: 1, 2, 3
>>
>> resource_groupB:
>> schemata: <high priority allocations>
>> tasks when in kernel space: 1
>>
>> resource_groupC:
>> schemata: <medium priority allocations>
>> tasks when in kernel space: 2, 3
>
> I'm not sure if this would happen in the real world or not.
Ack. I would like to echo Tony's request for feedback from resctrl users
https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>
>>
>> If user space is forced to have the same tasks have the same user space and kernel
>> allocations then that will force user space to create additional resource groups that
>> will use up CLOSID/PARTID that is a scarce resource.
>
> This may be undesirable even if CLOSID/PARTID were unlimited as controls which set
> a per-CLOSID/PARTID maximum don't have the same effect if the tasks are spread across
> more than one CLOSID/PARTID.
Thank you for bringing this up. I did not consider the mechanics of the memory bandwidth
controls.
>
>>
>> - There may be a scenario where the user is attempting to understand system behavior by
>> monitoring individual or subsets of tasks' bandwidth usage when in kernel space.
>
> This seems useful to me.
>
>>
>> - From what I can tell PLZA also supports *different* allocations when in user vs
>> kernel space while using the *same* monitoring group for both. This does not seem
>> transferable to MPAM and would take more effort to support in resctrl but it is
>> a use case that the hardware enables.
>
> Ah yes, I think this ends the 'kernel_group' idea then. I was too focused on
> MPAM and forgotten to consider the case where PMG and PARTID are independent.
Of course we would want user space to have consistent experience from resctrl no matter the
architecture so these places where architectures behave different needs more care.
>> When enabling a feature I would of course prefer not to add unnecessary complexity. Even so,
>> resctrl is expected to expose hardware capabilities to user space. There seems to be some
>> opinions on how user space will now and forever interact with these features that
>> are not clear to me so I would appreciate more insight in why these constraints are
>> appropriate.
>
> Yes, care definitely needs to be taken here in order to not back ourselves into
> a corner.
I really appreciate the discussions to help create a useful interface.
Reinette
>
>>
>> Reinette
>>
>>> can mean use the current group's when in the kernel (as well as for
>>> userspace). A slash, /, could be used to refer to the default group. This would
>>> give something like the below under /sys/fs/resctrl.
>>>
>>> .
>>> ├── cpus
>>> ├── tasks
>>> ├── ctrl1
>>> │ ├── cpus
>>> │ ├── kernel_group -> mon_groups/mon1
>>> │ └── tasks
>>> ├── kernel_group -> ctrl1
>>> └── mon_groups
>>> └── mon1
>>> ├── cpus
>>> ├── kernel_group -> ctrl1
>>> └── tasks
>>>
>>>>
>>>> I have not read anything about the RISC-V side of this yet.
>>>>
>>>> Reinette
>>>>
>>>>>
>>>>> Reinette
>>>>>
>>>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>>>
>>>
>>> Thanks,
>>>
>>> Ben
>>
>
> Thanks,
>
> Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-12 3:51 ` Reinette Chatre
@ 2026-02-12 19:09 ` Babu Moger
2026-02-13 0:05 ` Reinette Chatre
2026-02-20 10:07 ` Ben Horgan
` (2 subsequent siblings)
3 siblings, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-02-12 19:09 UTC (permalink / raw)
To: Reinette Chatre, Moger, Babu, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/11/26 21:51, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/11/26 1:18 PM, Babu Moger wrote:
>> On 2/11/26 10:54, Reinette Chatre wrote:
>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>>>> On AMD systems, the existing MBA feature allows the user to set a bandwidth
>>>>>> limit for each QOS domain. However, multiple QOS domains share system
>>>>>> memory bandwidth as a resource. In order to ensure that system memory
>>>>>> bandwidth is not over-utilized, user must statically partition the
>>>>>> available system bandwidth between the active QOS domains. This typically
>>>>> How do you define "active" QoS Domain?
>>>> Some domains may not have any CPUs associated with that CLOSID. Active meant, I'm referring to domains that have CPUs assigned to the CLOSID.
>>> To confirm, is this then specific to assigning CPUs to resource groups via
>>> the cpus/cpus_list files? This refers to how a user needs to partition
>>> available bandwidth so I am still trying to understand the message here since
>>> users still need to do this even when CPUs are not assigned to resource
>>> groups.
>>>
>> It is not specific to CPU assignment. It applies to task assignment also.
>>
>> For example: We have 4 domains;
>>
>> # cat schemata
>> MB:0=8192;1=8192;2=8192;3=8192
>>
>> If this group has the CPUs assigned to only first two domains. Then the group has only two active domains. Then we will only update the first two domains. The MB values in other domains does not matter.
> I see, thank you. As I understand an "active QoS domain" is something only user
> space can designate. It may be possible for resctrl to get a sense of which QoS domains
> are "active" when only CPUs are assigned to a resource group but when it comes to task
> assignment it is user space that controls where tasks belonging to a group can be
> scheduled and thus which QoS domains are "active" or not.
Yes. In case of task assignment, it depends on where the task is
scheduled. Users(admins) normally have a idea where to run their
workload.
>> #echo"MB:0=8;1=8" > schemata
>>
>> # cat schemata
>> MB:0=8;1=8;2=8192;3=8192
>>
>> The combined bandwidth can go up to 16(8+8) units. Each unit is 1/8 GB.
>>
>> With GMBA, we can set the combined limit higher level and total bandwidth will not exceed GMBA limit.
> Thank you for the confirmation.
>
>>>>>> results in system memory being under-utilized since not all QOS domains are
>>>>>> using their full bandwidth Allocation.
>>>>>>
>>>>>> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
>>>>>> for software to specify bandwidth limits for groups of threads that span
>>>>>> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
>>>>>> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
>>>>>> in GLBE control domain. Bandwidth is shared by all threads in a Class of
>>>>>> Service(COS) across every QoS domain managed by the GLBE control domain.
>>>>> How does this bandwidth allocation limit impact existing MBA? For example, if a
>>>>> system has two domains (A and B) that user space separately sets MBA
>>>>> allocations for while also placing both domains within a "GLBE control domain"
>>>>> with a different allocation, does the individual MBA allocations still matter?
>>>> Yes. Both ceilings are enforced at their respective levels.
>>>> The MBA ceiling is applied at the QoS domain level.
>>>> The GLBE ceiling is applied at the GLBE control domain level.
>>>> If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
>>> It sounds as though MBA and GMBA/GLBE operates within the same parameters wrt
>>> the limits but in examples in this series they have different limits. For example,
>>> in the documentation patch [1] there is this:
>>>
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=4096;1=4096;2=4096;3=4096
>>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>>
>>> followed up with what it will look like in new generation [2]:
>>>
>>> GMB:0=4096;1=4096;2=4096;3=4096
>>> MB:0=8192;1=8192;2=8192;3=8192
>>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>>
>>> In both examples the per-domain MB ceiling is higher than the global GMB ceiling. With
>>> above showing defaults and you state "If the MBA ceiling exceeds the GLBE ceiling,
>>> the effective MBA limit will be capped by the GLBE ceiling." - does this mean that
>>> MB ceiling can never be higher than GMB ceiling as shown in the examples?
>> That is correct. There is one more information here. The MB unit is in 1/8 GB and GMB unit is 1GB. I have added that in documentation in patch 4.
> ah - right. I did not take the different units into account.
>
>> The GMB limit defaults to max value 4096 (bit 12 set) when the new group is created. Meaning GMB limit does not apply by default.
>>
>> When setting the limits, it should be set to same value in all the domains in GMB control domain. Having different value in each domain results in unexpected behavior.
>>
>>> Another question, when setting aside possible differences between MB and GMB.
>>>
>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>
>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>> same:
>>>
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=2048;1=2048;2=2048;3=2048
>>>
>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>> MB limit:
>>> # echo"GMB:0=8;2=8" > schemata
>>> # cat schemata
>>> GMB:0=8;1=2048;2=8;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
> Thank you for confirming.
>
>>> ... and then when user space resets GMB the MB can reset like ...
>>>
>>> # echo"GMB:0=2048;2=2048" > schemata
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=2048;1=2048;2=2048;3=2048
>>>
>>> if I understand correctly this will only apply if the MB limit was never set so
>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>>>
>>> # echo"GMB:0=8;2=8" > schemata
>>> # cat schemata
>>> GMB:0=8;1=2048;2=8;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>>>
>>> # echo"GMB:0=2048;2=2048" > schemata
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>>>
>>> What would be most intuitive way for user to interact with the interfaces?
>> I see that you are trying to display the effective behaviors above.
> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
> what would be a reasonable expectation from resctrl be during these interactions.
>
>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
> settings may cause confusion?
I mean in many cases, we cannot determine the effective settings
correctly. It depends on benchmarks or applications running on the system.
Even with MB (without GMB support), even though we set the limit to
10GB, it may not use the whole 10GB. Memory is shared resource. So, the
effective bandwidth usage depends on other applications running on the
system.
>> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
> Yes, this will require resctrl to maintain more state.
>
> Documenting behavior is an option but I think we should first consider if there are things
> resctrl can do to make the interface intuitive to use.
>
>>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>>
>>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>>> I hope this clarifies your question.
>>> No. When enumerating the features the number of CLOSID supported by each is
>>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
>> No. There is not such scenario.
>>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>>> scenarios where some resource groups can support global AND per-domain limits while other
>>> resource groups can just support global or just support per-domain limits. Is this correct?
>> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
> It is not a concern to have different CLOSIDs between resources that are actually different,
> for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
> allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
> challenge though. Would it be possible to have a snippet in the spec that explicitly states
> that MB and GMB will always enumerate with the same number of CLOSIDs?
I have confirmed that is the case always. All current and planned
implementations, MB and GMB will have the same number of CLOSIDs.
> Please see below where I will try to support this request more clearly and you can decide if
> it is reasonable.
>
>>>>> can be seen as a single "resource" that can be allocated differently based on
>>>>> the various schemata associated with that resource. This currently has a
>>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>>> may be something that we can reconsider?
>>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>>> The new approach is not final so please provide feedback to help improve it so
>>> that the features you are enabling can be supported well.
>> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
> It benefits all architectures.
>
> There are two parts to the current proposals.
>
> Part 1: Generic schema description
> I believe there is consensus on this approach. This is actually something that is long
> overdue and something like this would have been a great to have with the initial AMD
> enabling. With the generic schema description forming part of resctrl the user can learn
> from resctrl how to interact with the schemata file instead of relying on external information
> and documentation.
ok.
> For example, on an Intel system that uses percentage based proportional allocation for memory
> bandwidth the new resctrl files will display:
> info/MB/resource_schemata/MB/type:scalar linear
> info/MB/resource_schemata/MB/unit:all
> info/MB/resource_schemata/MB/scale:1
> info/MB/resource_schemata/MB/resolution:100
> info/MB/resource_schemata/MB/tolerance:0
> info/MB/resource_schemata/MB/max:100
> info/MB/resource_schemata/MB/min:10
>
>
> On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
> info/MB/resource_schemata/MB/type:scalar linear
> info/MB/resource_schemata/MB/unit:GBps
> info/MB/resource_schemata/MB/scale:1
> info/MB/resource_schemata/MB/resolution:8
> info/MB/resource_schemata/MB/tolerance:0
> info/MB/resource_schemata/MB/max:2048
> info/MB/resource_schemata/MB/min:1
>
> Having such interface will be helpful today. Users do not need to first figure out
> whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
> before interacting with resctrl. resctrl will be the generic interface it intends to be.
Yes. That is a good point.
> Part 2: Supporting multiple controls for a single resource
> This is a new feature on which there also appears to be consensus that is needed by MPAM and
> Intel RDT where it is possible to use different controls for the same resource. For example,
> there can be a minimum and maximum control associated with the memory bandwidth resource.
>
> For example,
> info/
> └─ MB/
> └─ resource_schemata/
> ├─ MB/
> ├─ MB_MIN/
> ├─ MB_MAX/
> ┆
>
>
> Here is where the big question comes in for GLBE - is this actually a new resource
> for which resctrl needs to add interfaces to manage its allocation, or is it instead
> an additional control associated with the existing memory bandwith resource?
It is not a new resource. It is new control mechanism to address
limitation with memory bandwidth resource.
So, it is a new control for the existing memory bandwidth resource.
> For me things are actually pointing to GLBE not being a new resource but instead being
> a new control for the existing memory bandwidth resource.
>
> I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
> done in this series but when considering it as an actual unique resource does not seem
> appropriate since resctrl already has a "memory bandwidth" resource. User space expects
> to find all the resources that it can allocate in info/ - I do not think it is correct
> to have two separate directories/resources for memory bandwidth here.
>
> What if, instead, it looks something like:
>
> info/
> └── MB/
> └── resource_schemata/
> ├── GMB/
> │ ├──max:4096
> │ ├──min:1
> │ ├──resolution:1
> │ ├──scale:1
> │ ├──tolerance:0
> │ ├──type:scalar linear
> │ └──unit:GBps
> └── MB/
> ├──max:8192
> ├──min:1
> ├──resolution:8
> ├──scale:1
> ├──tolerance:0
> ├──type:scalar linear
> └──unit:GBps
Yes. It definitely looks very clean.
> With an interface like above GMB is just another control/schema used to allocate the
> existing memory bandwidth resource. With the planned files it is possible to express the
> different maximums and units used by the MB and GMB schema. Users no longer need to
> dig for the unit information in the docs, it is available in the interface.
Yes. That is reasonable.
Is the plan to just update the resource information in
/sys/fs/resctrl/info/<resource_name> ?
Also, will the display of /sys/fs/resctrl/schemata change ?
Current display:
GMB:0=4096;1=4096;2=4096;3=4096
MB:0=8192;1=8192;2=8192;3=8192
> Doing something like this does depend on GLBE supporting the same number of CLOSIDs
> as MB, which seems to be how this will be implemented. If there is indeed a confirmation
> of this from AMD architecture then we can do something like this in resctrl.
I don't see this being an issue. I will get consensus on it.
I am wondering about the time frame and who is leading this change. Not
sure if that is been discussed already.
I can definitely help.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-12 19:09 ` Babu Moger
@ 2026-02-13 0:05 ` Reinette Chatre
2026-02-13 1:51 ` Moger, Babu
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-13 0:05 UTC (permalink / raw)
To: Babu Moger, Moger, Babu, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 2/12/26 11:09 AM, Babu Moger wrote:
> Hi Reinette,
>
> On 2/11/26 21:51, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 2/11/26 1:18 PM, Babu Moger wrote:
>>> On 2/11/26 10:54, Reinette Chatre wrote:
>>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
...
>>>> Another question, when setting aside possible differences between MB and GMB.
>>>>
>>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>>
>>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>>> same:
>>>>
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>
>>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>>> MB limit:
>>>> # echo"GMB:0=8;2=8" > schemata
>>>> # cat schemata
>>>> GMB:0=8;1=2048;2=8;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>> Thank you for confirming.
>>
>>>> ... and then when user space resets GMB the MB can reset like ...
>>>>
>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>
>>>> if I understand correctly this will only apply if the MB limit was never set so
>>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>>
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>>>
>>>> # echo"GMB:0=8;2=8" > schemata
>>>> # cat schemata
>>>> GMB:0=8;1=2048;2=8;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>>>
>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>>>
>>>> What would be most intuitive way for user to interact with the interfaces?
>>> I see that you are trying to display the effective behaviors above.
>> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
>> what would be a reasonable expectation from resctrl be during these interactions.
>>
>>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
>> settings may cause confusion?
>
> I mean in many cases, we cannot determine the effective settings correctly. It depends on benchmarks or applications running on the system.
>
> Even with MB (without GMB support), even though we set the limit to 10GB, it may not use the whole 10GB. Memory is shared resource. So, the effective bandwidth usage depends on other applications running on the system.
Sounds like we interpret "effective limits" differently. To me the limits(*) are deterministic.
If I understand correctly, if the GMB limit for domains A and B is set to x GB then that places
an x GB limit on MB for domains A and B also. Displaying any MB limit in the schemata that is
larger than x GB for domain A or domain B would be inaccurate, no?
When considering your example where the MB limit is 10GB.
Consider an example where there are two domains in this example with a configuration like below.
(I am using a different syntax from schemata file that will hopefully make it easier to exchange
ideas when not having to interpret the different GMB and MB units):
MB:0=10GB;1=10GB
If user space can create a GMB domain that limits shared bandwidth to 10GB that can be displayed
as below and will be accurate:
MB:0=10GB;1=10GB
GMB:0=10GB;1=10GB
If user space then reduces the combined bandwidth to 2GB then the MB limit is wrong since it
is actually capped by the GMB limit:
MB:0=10GB;1=10GB <==== Does reflect possible per-domain memory bandwidth which is now capped by GMB
GMB:0=2GB;1=2GB
Would something like below not be more accurate that reflects that the maximum average bandwidth
each domain could achieve is 2GB?
MB:0=2GB;1=2GB <==== Reflects accurate possible per-domain memory bandwidth
GMB:0=2GB;1=2GB
(*) As a side-note we may have to start being careful with how we use "limits" because of the planned
introduction of a "MAX" as a bandwidth control that is an actual limit as opposed to the
current control that is approximate.
>>> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
>> Yes, this will require resctrl to maintain more state.
>>
>> Documenting behavior is an option but I think we should first consider if there are things
>> resctrl can do to make the interface intuitive to use.
>>
>>>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>>>
>>>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>>>> I hope this clarifies your question.
>>>> No. When enumerating the features the number of CLOSID supported by each is
>>>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>>>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
>>> No. There is not such scenario.
>>>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>>>> scenarios where some resource groups can support global AND per-domain limits while other
>>>> resource groups can just support global or just support per-domain limits. Is this correct?
>>> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
>> It is not a concern to have different CLOSIDs between resources that are actually different,
>> for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
>> allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
>> challenge though. Would it be possible to have a snippet in the spec that explicitly states
>> that MB and GMB will always enumerate with the same number of CLOSIDs?
>
> I have confirmed that is the case always. All current and planned implementations, MB and GMB will have the same number of CLOSIDs.
Thank you very much for confirming. Is this something the architects would be willing to
commit to with a snippet in the PQoS spec?
>> Please see below where I will try to support this request more clearly and you can decide if
>> it is reasonable.
>>
>>>>>> can be seen as a single "resource" that can be allocated differently based on
>>>>>> the various schemata associated with that resource. This currently has a
>>>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>>>> may be something that we can reconsider?
>>>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>>>> The new approach is not final so please provide feedback to help improve it so
>>>> that the features you are enabling can be supported well.
>>> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
>> It benefits all architectures.
>>
>> There are two parts to the current proposals.
>>
>> Part 1: Generic schema description
>> I believe there is consensus on this approach. This is actually something that is long
>> overdue and something like this would have been a great to have with the initial AMD
>> enabling. With the generic schema description forming part of resctrl the user can learn
>> from resctrl how to interact with the schemata file instead of relying on external information
>> and documentation.
>
> ok.
>
>> For example, on an Intel system that uses percentage based proportional allocation for memory
>> bandwidth the new resctrl files will display:
>> info/MB/resource_schemata/MB/type:scalar linear
>> info/MB/resource_schemata/MB/unit:all
>> info/MB/resource_schemata/MB/scale:1
>> info/MB/resource_schemata/MB/resolution:100
>> info/MB/resource_schemata/MB/tolerance:0
>> info/MB/resource_schemata/MB/max:100
>> info/MB/resource_schemata/MB/min:10
>>
>>
>> On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
>> info/MB/resource_schemata/MB/type:scalar linear
>> info/MB/resource_schemata/MB/unit:GBps
>> info/MB/resource_schemata/MB/scale:1
>> info/MB/resource_schemata/MB/resolution:8
>> info/MB/resource_schemata/MB/tolerance:0
>> info/MB/resource_schemata/MB/max:2048
>> info/MB/resource_schemata/MB/min:1
>>
>> Having such interface will be helpful today. Users do not need to first figure out
>> whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
>> before interacting with resctrl. resctrl will be the generic interface it intends to be.
>
> Yes. That is a good point.
>
>> Part 2: Supporting multiple controls for a single resource
>> This is a new feature on which there also appears to be consensus that is needed by MPAM and
>> Intel RDT where it is possible to use different controls for the same resource. For example,
>> there can be a minimum and maximum control associated with the memory bandwidth resource.
>>
>> For example,
>> info/
>> └─ MB/
>> └─ resource_schemata/
>> ├─ MB/
>> ├─ MB_MIN/
>> ├─ MB_MAX/
>> ┆
>>
>>
>> Here is where the big question comes in for GLBE - is this actually a new resource
>> for which resctrl needs to add interfaces to manage its allocation, or is it instead
>> an additional control associated with the existing memory bandwith resource?
>
> It is not a new resource. It is new control mechanism to address limitation with memory bandwidth resource.
>
> So, it is a new control for the existing memory bandwidth resource.
Thank you for confirming.
>
>> For me things are actually pointing to GLBE not being a new resource but instead being
>> a new control for the existing memory bandwidth resource.
>>
>> I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
>> done in this series but when considering it as an actual unique resource does not seem
>> appropriate since resctrl already has a "memory bandwidth" resource. User space expects
>> to find all the resources that it can allocate in info/ - I do not think it is correct
>> to have two separate directories/resources for memory bandwidth here.
>>
>> What if, instead, it looks something like:
>>
>> info/
>> └── MB/
>> └── resource_schemata/
>> ├── GMB/
>> │ ├──max:4096
>> │ ├──min:1
>> │ ├──resolution:1
>> │ ├──scale:1
>> │ ├──tolerance:0
>> │ ├──type:scalar linear
>> │ └──unit:GBps
>> └── MB/
>> ├──max:8192
>> ├──min:1
>> ├──resolution:8
>> ├──scale:1
>> ├──tolerance:0
>> ├──type:scalar linear
>> └──unit:GBps
>
> Yes. It definitely looks very clean.
>
>> With an interface like above GMB is just another control/schema used to allocate the
>> existing memory bandwidth resource. With the planned files it is possible to express the
>> different maximums and units used by the MB and GMB schema. Users no longer need to
>> dig for the unit information in the docs, it is available in the interface.
>
>
> Yes. That is reasonable.
>
> Is the plan to just update the resource information in /sys/fs/resctrl/info/<resource_name> ?
I do not see any resource information that needs to change. As you confirmed,
MB and GMB have the same number of CLOSIDs and looking at the rest of the
enumeration done in patch #2 all other properties exposed in top level of
/sys/fs/resctrl/info/MB is the same for MB and GMB. Specifically,
thread_throttle_mode, delay_linear, min_bandwidth, and bandwidth_gran have
the same values for MB and GMB. All other content in
/sys/fs/resctrl/info/MB would be new as part of the new "resource_schemata"
sub-directory.
Even so, I believe we could expect that a user using any new schemata file entry
introduced after the "resource_schemata" directory is introduced is aware of how
the properties are exposed and will not use the top level files in /sys/fs/resctrl/info/MB
(for example min_bandwidth and bandwidth_gran) to understand how to interact with
the new schema.
>
> Also, will the display of /sys/fs/resctrl/schemata change ?
There are no plans to change any of the existing schemata file entries.
>
> Current display:
When viewing "current" as what this series does in schemata file ...
>
> GMB:0=4096;1=4096;2=4096;3=4096
> MB:0=8192;1=8192;2=8192;3=8192
yes, the schemata file should look like this on boot when all is done. All other
user facing changes are to the info/ directory where user space learns about
the new control for the resource and how to interact with the control.
>> Doing something like this does depend on GLBE supporting the same number of CLOSIDs
>> as MB, which seems to be how this will be implemented. If there is indeed a confirmation
>> of this from AMD architecture then we can do something like this in resctrl.
>
> I don't see this being an issue. I will get consensus on it.
>
> I am wondering about the time frame and who is leading this change. Not sure if that is been discussed already.
> I can definitely help.
A couple of features depend on the new schema descriptions as well as support for multiple
controls: min/max bandwidth controls on the MPAM side, region aware MBA and MBM on the Intel
side, and GLBE on the AMD side. I am hoping that the folks working on these features can
collaborate on the needed foundation. Since there are no patches for this yet I cannot say
if there is a leader for this work yet, at this time this role appears to be available if you
would like to see this moving forward in order to meet your goals.
Reinette
>
> Thanks
>
> Babu
>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-13 0:05 ` Reinette Chatre
@ 2026-02-13 1:51 ` Moger, Babu
2026-02-13 16:17 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Moger, Babu @ 2026-02-13 1:51 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/12/2026 6:05 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/12/26 11:09 AM, Babu Moger wrote:
>> Hi Reinette,
>>
>> On 2/11/26 21:51, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 2/11/26 1:18 PM, Babu Moger wrote:
>>>> On 2/11/26 10:54, Reinette Chatre wrote:
>>>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>
> ...
>
>>>>> Another question, when setting aside possible differences between MB and GMB.
>>>>>
>>>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>>>
>>>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>>>> same:
>>>>>
>>>>> # cat schemata
>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>
>>>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>>>> MB limit:
>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>> # cat schemata
>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>> MB:0=8;1=2048;2=8;3=2048
>>>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>>> Thank you for confirming.
>>>
>>>>> ... and then when user space resets GMB the MB can reset like ...
>>>>>
>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>> # cat schemata
>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>
>>>>> if I understand correctly this will only apply if the MB limit was never set so
>>>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>>>
>>>>> # cat schemata
>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>
>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>> # cat schemata
>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>
>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>> # cat schemata
>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>
>>>>> What would be most intuitive way for user to interact with the interfaces?
>>>> I see that you are trying to display the effective behaviors above.
>>> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
>>> what would be a reasonable expectation from resctrl be during these interactions.
>>>
>>>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>>> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
>>> settings may cause confusion?
>>
>> I mean in many cases, we cannot determine the effective settings correctly. It depends on benchmarks or applications running on the system.
>>
>> Even with MB (without GMB support), even though we set the limit to 10GB, it may not use the whole 10GB. Memory is shared resource. So, the effective bandwidth usage depends on other applications running on the system.
>
> Sounds like we interpret "effective limits" differently. To me the limits(*) are deterministic.
> If I understand correctly, if the GMB limit for domains A and B is set to x GB then that places
> an x GB limit on MB for domains A and B also. Displaying any MB limit in the schemata that is
> larger than x GB for domain A or domain B would be inaccurate, no?
Yea. But, I was thinking not to mess with values written at registers.
>
> When considering your example where the MB limit is 10GB.
>
> Consider an example where there are two domains in this example with a configuration like below.
> (I am using a different syntax from schemata file that will hopefully make it easier to exchange
> ideas when not having to interpret the different GMB and MB units):
>
> MB:0=10GB;1=10GB
>
> If user space can create a GMB domain that limits shared bandwidth to 10GB that can be displayed
> as below and will be accurate:
>
> MB:0=10GB;1=10GB
> GMB:0=10GB;1=10GB
>
> If user space then reduces the combined bandwidth to 2GB then the MB limit is wrong since it
> is actually capped by the GMB limit:
>
> MB:0=10GB;1=10GB <==== Does reflect possible per-domain memory bandwidth which is now capped by GMB
> GMB:0=2GB;1=2GB
>
> Would something like below not be more accurate that reflects that the maximum average bandwidth
> each domain could achieve is 2GB?
>
> MB:0=2GB;1=2GB <==== Reflects accurate possible per-domain memory bandwidth
> GMB:0=2GB;1=2GB
That is reasonable. Will check how we can accommodate that.
>
> (*) As a side-note we may have to start being careful with how we use "limits" because of the planned
> introduction of a "MAX" as a bandwidth control that is an actual limit as opposed to the
> current control that is approximate.
>
>>>> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
>>> Yes, this will require resctrl to maintain more state.
>>>
>>> Documenting behavior is an option but I think we should first consider if there are things
>>> resctrl can do to make the interface intuitive to use.
>>>
>>>>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>>>>
>>>>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>>>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>>>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>>>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>>>>> I hope this clarifies your question.
>>>>> No. When enumerating the features the number of CLOSID supported by each is
>>>>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>>>>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
>>>> No. There is not such scenario.
>>>>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>>>>> scenarios where some resource groups can support global AND per-domain limits while other
>>>>> resource groups can just support global or just support per-domain limits. Is this correct?
>>>> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
>>> It is not a concern to have different CLOSIDs between resources that are actually different,
>>> for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
>>> allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
>>> challenge though. Would it be possible to have a snippet in the spec that explicitly states
>>> that MB and GMB will always enumerate with the same number of CLOSIDs?
>>
>> I have confirmed that is the case always. All current and planned implementations, MB and GMB will have the same number of CLOSIDs.
>
> Thank you very much for confirming. Is this something the architects would be willing to
> commit to with a snippet in the PQoS spec?
I checked on that. Here is the response.
"I do not plan to add a statement like that to the spec. The CPUID
enumeration allows for them to have different number of CLOS's supported
for each. However, it is true that for all current and planned
implementations, MB and GMB will have the same number of CLOS."
>
>>> Please see below where I will try to support this request more clearly and you can decide if
>>> it is reasonable.
>>>
>>>>>>> can be seen as a single "resource" that can be allocated differently based on
>>>>>>> the various schemata associated with that resource. This currently has a
>>>>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>>>>> may be something that we can reconsider?
>>>>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>>>>> The new approach is not final so please provide feedback to help improve it so
>>>>> that the features you are enabling can be supported well.
>>>> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
>>> It benefits all architectures.
>>>
>>> There are two parts to the current proposals.
>>>
>>> Part 1: Generic schema description
>>> I believe there is consensus on this approach. This is actually something that is long
>>> overdue and something like this would have been a great to have with the initial AMD
>>> enabling. With the generic schema description forming part of resctrl the user can learn
>>> from resctrl how to interact with the schemata file instead of relying on external information
>>> and documentation.
>>
>> ok.
>>
>>> For example, on an Intel system that uses percentage based proportional allocation for memory
>>> bandwidth the new resctrl files will display:
>>> info/MB/resource_schemata/MB/type:scalar linear
>>> info/MB/resource_schemata/MB/unit:all
>>> info/MB/resource_schemata/MB/scale:1
>>> info/MB/resource_schemata/MB/resolution:100
>>> info/MB/resource_schemata/MB/tolerance:0
>>> info/MB/resource_schemata/MB/max:100
>>> info/MB/resource_schemata/MB/min:10
>>>
>>>
>>> On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
>>> info/MB/resource_schemata/MB/type:scalar linear
>>> info/MB/resource_schemata/MB/unit:GBps
>>> info/MB/resource_schemata/MB/scale:1
>>> info/MB/resource_schemata/MB/resolution:8
>>> info/MB/resource_schemata/MB/tolerance:0
>>> info/MB/resource_schemata/MB/max:2048
>>> info/MB/resource_schemata/MB/min:1
>>>
>>> Having such interface will be helpful today. Users do not need to first figure out
>>> whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
>>> before interacting with resctrl. resctrl will be the generic interface it intends to be.
>>
>> Yes. That is a good point.
>>
>>> Part 2: Supporting multiple controls for a single resource
>>> This is a new feature on which there also appears to be consensus that is needed by MPAM and
>>> Intel RDT where it is possible to use different controls for the same resource. For example,
>>> there can be a minimum and maximum control associated with the memory bandwidth resource.
>>>
>>> For example,
>>> info/
>>> └─ MB/
>>> └─ resource_schemata/
>>> ├─ MB/
>>> ├─ MB_MIN/
>>> ├─ MB_MAX/
>>> ┆
>>>
>>>
>>> Here is where the big question comes in for GLBE - is this actually a new resource
>>> for which resctrl needs to add interfaces to manage its allocation, or is it instead
>>> an additional control associated with the existing memory bandwith resource?
>>
>> It is not a new resource. It is new control mechanism to address limitation with memory bandwidth resource.
>>
>> So, it is a new control for the existing memory bandwidth resource.
>
> Thank you for confirming.
>
>>
>>> For me things are actually pointing to GLBE not being a new resource but instead being
>>> a new control for the existing memory bandwidth resource.
>>>
>>> I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
>>> done in this series but when considering it as an actual unique resource does not seem
>>> appropriate since resctrl already has a "memory bandwidth" resource. User space expects
>>> to find all the resources that it can allocate in info/ - I do not think it is correct
>>> to have two separate directories/resources for memory bandwidth here.
>>>
>>> What if, instead, it looks something like:
>>>
>>> info/
>>> └── MB/
>>> └── resource_schemata/
>>> ├── GMB/
>>> │ ├──max:4096
>>> │ ├──min:1
>>> │ ├──resolution:1
>>> │ ├──scale:1
>>> │ ├──tolerance:0
>>> │ ├──type:scalar linear
>>> │ └──unit:GBps
>>> └── MB/
>>> ├──max:8192
>>> ├──min:1
>>> ├──resolution:8
>>> ├──scale:1
>>> ├──tolerance:0
>>> ├──type:scalar linear
>>> └──unit:GBps
>>
>> Yes. It definitely looks very clean.
>>
>>> With an interface like above GMB is just another control/schema used to allocate the
>>> existing memory bandwidth resource. With the planned files it is possible to express the
>>> different maximums and units used by the MB and GMB schema. Users no longer need to
>>> dig for the unit information in the docs, it is available in the interface.
>>
>>
>> Yes. That is reasonable.
>>
>> Is the plan to just update the resource information in /sys/fs/resctrl/info/<resource_name> ?
>
> I do not see any resource information that needs to change. As you confirmed,
> MB and GMB have the same number of CLOSIDs and looking at the rest of the
> enumeration done in patch #2 all other properties exposed in top level of
> /sys/fs/resctrl/info/MB is the same for MB and GMB. Specifically,
> thread_throttle_mode, delay_linear, min_bandwidth, and bandwidth_gran have
> the same values for MB and GMB. All other content in
> /sys/fs/resctrl/info/MB would be new as part of the new "resource_schemata"
> sub-directory.
>
> Even so, I believe we could expect that a user using any new schemata file entry
> introduced after the "resource_schemata" directory is introduced is aware of how
> the properties are exposed and will not use the top level files in /sys/fs/resctrl/info/MB
> (for example min_bandwidth and bandwidth_gran) to understand how to interact with
> the new schema.
>
>
>>
>> Also, will the display of /sys/fs/resctrl/schemata change ?
>
> There are no plans to change any of the existing schemata file entries.
>
>>
>> Current display:
>
> When viewing "current" as what this series does in schemata file ...
>
>>
>> GMB:0=4096;1=4096;2=4096;3=4096
>> MB:0=8192;1=8192;2=8192;3=8192
>
> yes, the schemata file should look like this on boot when all is done. All other
> user facing changes are to the info/ directory where user space learns about
> the new control for the resource and how to interact with the control.
>
>>> Doing something like this does depend on GLBE supporting the same number of CLOSIDs
>>> as MB, which seems to be how this will be implemented. If there is indeed a confirmation
>>> of this from AMD architecture then we can do something like this in resctrl.
>>
>> I don't see this being an issue. I will get consensus on it.
>>
>> I am wondering about the time frame and who is leading this change. Not sure if that is been discussed already.
>> I can definitely help.
>
> A couple of features depend on the new schema descriptions as well as support for multiple
> controls: min/max bandwidth controls on the MPAM side, region aware MBA and MBM on the Intel
> side, and GLBE on the AMD side. I am hoping that the folks working on these features can
> collaborate on the needed foundation. Since there are no patches for this yet I cannot say
> if there is a leader for this work yet, at this time this role appears to be available if you
> would like to see this moving forward in order to meet your goals.
I joined this feature effort a bit later, so I may not yet have full
context on the MPAM and region‑aware requirements. I’m happy to provide
all the necessary information for GMB and MB from the AMD side, and I’m
also available to help with reviews and testing.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure
2026-02-11 15:19 ` Ben Horgan
2026-02-11 16:54 ` Reinette Chatre
@ 2026-02-13 15:50 ` Moger, Babu
1 sibling, 0 replies; 114+ messages in thread
From: Moger, Babu @ 2026-02-13 15:50 UTC (permalink / raw)
To: Ben Horgan, Babu Moger, corbet, tony.luck, reinette.chatre,
Dave.Martin, james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Ben,
On 2/11/2026 9:19 AM, Ben Horgan wrote:
> Hi Babu,
>
> On 1/21/26 21:12, Babu Moger wrote:
>> Add plza_capable field to the rdt_resource structure to indicate whether
>> Privilege Level Zero Association (PLZA) is supported for that resource
>> type.
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++++
>> include/linux/resctrl.h | 3 +++
>> 3 files changed, 14 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 2de3140dd6d1..e41fe5fa3f30 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -295,6 +295,9 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r)
>>
>> r->alloc_capable = true;
>>
>> + if (rdt_cpu_has(X86_FEATURE_PLZA))
>> + r->plza_capable = true;
>> +
>> return true;
>> }
>>
>> @@ -314,6 +317,9 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
>> if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
>> r->cache.arch_has_sparse_bitmasks = ecx.split.noncont;
>> r->alloc_capable = true;
>> +
>> + if (rdt_cpu_has(X86_FEATURE_PLZA))
>> + r->plza_capable = true;
>> }
>>
>> static void rdt_get_cdp_config(int level)
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 885026468440..540e1e719d7f 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -229,6 +229,11 @@ bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>> return rdt_resources_all[l].cdp_enabled;
>> }
>>
>> +bool resctrl_arch_get_plza_capable(enum resctrl_res_level l)
>> +{
>> + return rdt_resources_all[l].r_resctrl.plza_capable;
>> +}
>> +
>> void resctrl_arch_reset_all_ctrls(struct rdt_resource *r)
>> {
>> struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index 63d74c0dbb8f..ae252a0e6d92 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -319,6 +319,7 @@ struct resctrl_mon {
>> * @name: Name to use in "schemata" file.
>> * @schema_fmt: Which format string and parser is used for this schema.
>> * @cdp_capable: Is the CDP feature available on this resource
>> + * @plza_capable: Is Privilege Level Zero Association capable?
>> */
>> struct rdt_resource {
>> int rid;
>> @@ -334,6 +335,7 @@ struct rdt_resource {
>> char *name;
>> enum resctrl_schema_fmt schema_fmt;
>> bool cdp_capable;
>> + bool plza_capable;
>
> Why are you making plza a resource property? Certainly for MPAM we'd
> want this to be global across resources and I see above that you are
> just checking a cpu property rather then anything per resource.
Yes. I agree. This does not have to be resource property.
Will make it as global property where it can be set from arch.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-13 1:51 ` Moger, Babu
@ 2026-02-13 16:17 ` Reinette Chatre
2026-02-13 23:14 ` Moger, Babu
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-13 16:17 UTC (permalink / raw)
To: Moger, Babu, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 2/12/26 5:51 PM, Moger, Babu wrote:
> On 2/12/2026 6:05 PM, Reinette Chatre wrote:
>> On 2/12/26 11:09 AM, Babu Moger wrote:
>>> On 2/11/26 21:51, Reinette Chatre wrote:
>>>> On 2/11/26 1:18 PM, Babu Moger wrote:
>>>>> On 2/11/26 10:54, Reinette Chatre wrote:
>>>>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>
>> ...
>>
>>>>>> Another question, when setting aside possible differences between MB and GMB.
>>>>>>
>>>>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>>>>
>>>>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>>>>> same:
>>>>>>
>>>>>> # cat schemata
>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>
>>>>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>>>>> MB limit:
>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>> # cat schemata
>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>>>> Thank you for confirming.
>>>>
>>>>>> ... and then when user space resets GMB the MB can reset like ...
>>>>>>
>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>> # cat schemata
>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>
>>>>>> if I understand correctly this will only apply if the MB limit was never set so
>>>>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>>>>
>>>>>> # cat schemata
>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>
>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>> # cat schemata
>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>
>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>> # cat schemata
>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>
>>>>>> What would be most intuitive way for user to interact with the interfaces?
>>>>> I see that you are trying to display the effective behaviors above.
>>>> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
>>>> what would be a reasonable expectation from resctrl be during these interactions.
>>>>
>>>>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>>>> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
>>>> settings may cause confusion?
>>>
>>> I mean in many cases, we cannot determine the effective settings correctly. It depends on benchmarks or applications running on the system.
>>>
>>> Even with MB (without GMB support), even though we set the limit to 10GB, it may not use the whole 10GB. Memory is shared resource. So, the effective bandwidth usage depends on other applications running on the system.
>>
>> Sounds like we interpret "effective limits" differently. To me the limits(*) are deterministic.
>> If I understand correctly, if the GMB limit for domains A and B is set to x GB then that places
>> an x GB limit on MB for domains A and B also. Displaying any MB limit in the schemata that is
>> larger than x GB for domain A or domain B would be inaccurate, no?
>
> Yea. But, I was thinking not to mess with values written at registers.
This is not about what is written to the registers but how the combined values
written to registers control system behavior and how to accurately reflect the
resulting system behavior to user space.
>> When considering your example where the MB limit is 10GB.
>>
>> Consider an example where there are two domains in this example with a configuration like below.
>> (I am using a different syntax from schemata file that will hopefully make it easier to exchange
>> ideas when not having to interpret the different GMB and MB units):
>>
>> MB:0=10GB;1=10GB
>>
>> If user space can create a GMB domain that limits shared bandwidth to 10GB that can be displayed
>> as below and will be accurate:
>>
>> MB:0=10GB;1=10GB
>> GMB:0=10GB;1=10GB
>>
>> If user space then reduces the combined bandwidth to 2GB then the MB limit is wrong since it
>> is actually capped by the GMB limit:
>>
>> MB:0=10GB;1=10GB <==== Does reflect possible per-domain memory bandwidth which is now capped by GMB
>> GMB:0=2GB;1=2GB
>>
>> Would something like below not be more accurate that reflects that the maximum average bandwidth
>> each domain could achieve is 2GB?
>>
>> MB:0=2GB;1=2GB <==== Reflects accurate possible per-domain memory bandwidth
>> GMB:0=2GB;1=2GB
>
> That is reasonable. Will check how we can accommodate that.
Right, this is not about the values in the L3BE registers but instead how those values
are impacted by GLBE registers and how to most accurately present the resulting system
configuration to user space. Thank you for considering.
>
>>
>> (*) As a side-note we may have to start being careful with how we use "limits" because of the planned
>> introduction of a "MAX" as a bandwidth control that is an actual limit as opposed to the
>> current control that is approximate.
>>
>>>>> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
>>>> Yes, this will require resctrl to maintain more state.
>>>>
>>>> Documenting behavior is an option but I think we should first consider if there are things
>>>> resctrl can do to make the interface intuitive to use.
>>>>
>>>>>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>>>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>>>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>>>>>
>>>>>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>>>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>>>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>>>>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>>>>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>>>>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>>>>>> I hope this clarifies your question.
>>>>>> No. When enumerating the features the number of CLOSID supported by each is
>>>>>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>>>>>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
>>>>> No. There is not such scenario.
>>>>>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>>>>>> scenarios where some resource groups can support global AND per-domain limits while other
>>>>>> resource groups can just support global or just support per-domain limits. Is this correct?
>>>>> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
>>>> It is not a concern to have different CLOSIDs between resources that are actually different,
>>>> for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
>>>> allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
>>>> challenge though. Would it be possible to have a snippet in the spec that explicitly states
>>>> that MB and GMB will always enumerate with the same number of CLOSIDs?
>>>
>>> I have confirmed that is the case always. All current and planned implementations, MB and GMB will have the same number of CLOSIDs.
>>
>> Thank you very much for confirming. Is this something the architects would be willing to
>> commit to with a snippet in the PQoS spec?
>
> I checked on that. Here is the response.
>
> "I do not plan to add a statement like that to the spec. The CPUID enumeration allows for them to have different number of CLOS's supported for each. However, it is true that for all current and planned implementations, MB and GMB will have the same number of CLOS."
Thank you for asking. At this time the definition of a resource's "num_closids" is:
"num_closids":
The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
CLOSIDs of all enabled resources as limit.
Without commitment from architecture we could expand definition of "num_closids" when
adding multiple controls to indicate that it is the smallest number of CLOSIDs supported
by all controls.
>>>> Please see below where I will try to support this request more clearly and you can decide if
>>>> it is reasonable.
>>>>
>>>>>>>> can be seen as a single "resource" that can be allocated differently based on
>>>>>>>> the various schemata associated with that resource. This currently has a
>>>>>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>>>>>> may be something that we can reconsider?
>>>>>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>>>>>> The new approach is not final so please provide feedback to help improve it so
>>>>>> that the features you are enabling can be supported well.
>>>>> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
>>>> It benefits all architectures.
>>>>
>>>> There are two parts to the current proposals.
>>>>
>>>> Part 1: Generic schema description
>>>> I believe there is consensus on this approach. This is actually something that is long
>>>> overdue and something like this would have been a great to have with the initial AMD
>>>> enabling. With the generic schema description forming part of resctrl the user can learn
>>>> from resctrl how to interact with the schemata file instead of relying on external information
>>>> and documentation.
>>>
>>> ok.
>>>
>>>> For example, on an Intel system that uses percentage based proportional allocation for memory
>>>> bandwidth the new resctrl files will display:
>>>> info/MB/resource_schemata/MB/type:scalar linear
>>>> info/MB/resource_schemata/MB/unit:all
>>>> info/MB/resource_schemata/MB/scale:1
>>>> info/MB/resource_schemata/MB/resolution:100
>>>> info/MB/resource_schemata/MB/tolerance:0
>>>> info/MB/resource_schemata/MB/max:100
>>>> info/MB/resource_schemata/MB/min:10
>>>>
>>>>
>>>> On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
>>>> info/MB/resource_schemata/MB/type:scalar linear
>>>> info/MB/resource_schemata/MB/unit:GBps
>>>> info/MB/resource_schemata/MB/scale:1
>>>> info/MB/resource_schemata/MB/resolution:8
>>>> info/MB/resource_schemata/MB/tolerance:0
>>>> info/MB/resource_schemata/MB/max:2048
>>>> info/MB/resource_schemata/MB/min:1
>>>>
>>>> Having such interface will be helpful today. Users do not need to first figure out
>>>> whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
>>>> before interacting with resctrl. resctrl will be the generic interface it intends to be.
>>>
>>> Yes. That is a good point.
>>>
>>>> Part 2: Supporting multiple controls for a single resource
>>>> This is a new feature on which there also appears to be consensus that is needed by MPAM and
>>>> Intel RDT where it is possible to use different controls for the same resource. For example,
>>>> there can be a minimum and maximum control associated with the memory bandwidth resource.
>>>>
>>>> For example,
>>>> info/
>>>> └─ MB/
>>>> └─ resource_schemata/
>>>> ├─ MB/
>>>> ├─ MB_MIN/
>>>> ├─ MB_MAX/
>>>> ┆
>>>>
>>>>
>>>> Here is where the big question comes in for GLBE - is this actually a new resource
>>>> for which resctrl needs to add interfaces to manage its allocation, or is it instead
>>>> an additional control associated with the existing memory bandwith resource?
>>>
>>> It is not a new resource. It is new control mechanism to address limitation with memory bandwidth resource.
>>>
>>> So, it is a new control for the existing memory bandwidth resource.
>>
>> Thank you for confirming.
>>
>>>
>>>> For me things are actually pointing to GLBE not being a new resource but instead being
>>>> a new control for the existing memory bandwidth resource.
>>>>
>>>> I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
>>>> done in this series but when considering it as an actual unique resource does not seem
>>>> appropriate since resctrl already has a "memory bandwidth" resource. User space expects
>>>> to find all the resources that it can allocate in info/ - I do not think it is correct
>>>> to have two separate directories/resources for memory bandwidth here.
>>>>
>>>> What if, instead, it looks something like:
>>>>
>>>> info/
>>>> └── MB/
>>>> └── resource_schemata/
>>>> ├── GMB/
>>>> │ ├──max:4096
>>>> │ ├──min:1
>>>> │ ├──resolution:1
>>>> │ ├──scale:1
>>>> │ ├──tolerance:0
>>>> │ ├──type:scalar linear
>>>> │ └──unit:GBps
>>>> └── MB/
>>>> ├──max:8192
>>>> ├──min:1
>>>> ├──resolution:8
>>>> ├──scale:1
>>>> ├──tolerance:0
>>>> ├──type:scalar linear
>>>> └──unit:GBps
>>>
>>> Yes. It definitely looks very clean.
>>>
>>>> With an interface like above GMB is just another control/schema used to allocate the
>>>> existing memory bandwidth resource. With the planned files it is possible to express the
>>>> different maximums and units used by the MB and GMB schema. Users no longer need to
>>>> dig for the unit information in the docs, it is available in the interface.
>>>
>>>
>>> Yes. That is reasonable.
>>>
>>> Is the plan to just update the resource information in /sys/fs/resctrl/info/<resource_name> ?
>>
>> I do not see any resource information that needs to change. As you confirmed,
>> MB and GMB have the same number of CLOSIDs and looking at the rest of the
>> enumeration done in patch #2 all other properties exposed in top level of
>> /sys/fs/resctrl/info/MB is the same for MB and GMB. Specifically,
>> thread_throttle_mode, delay_linear, min_bandwidth, and bandwidth_gran have
>> the same values for MB and GMB. All other content in
>> /sys/fs/resctrl/info/MB would be new as part of the new "resource_schemata"
>> sub-directory.
>>
>> Even so, I believe we could expect that a user using any new schemata file entry
>> introduced after the "resource_schemata" directory is introduced is aware of how
>> the properties are exposed and will not use the top level files in /sys/fs/resctrl/info/MB
>> (for example min_bandwidth and bandwidth_gran) to understand how to interact with
>> the new schema.
>>
>>
>>>
>>> Also, will the display of /sys/fs/resctrl/schemata change ?
>>
>> There are no plans to change any of the existing schemata file entries.
>>
>>>
>>> Current display:
>>
>> When viewing "current" as what this series does in schemata file ...
>>
>>>
>>> GMB:0=4096;1=4096;2=4096;3=4096
>>> MB:0=8192;1=8192;2=8192;3=8192
>>
>> yes, the schemata file should look like this on boot when all is done. All other
>> user facing changes are to the info/ directory where user space learns about
>> the new control for the resource and how to interact with the control.
>>
>>>> Doing something like this does depend on GLBE supporting the same number of CLOSIDs
>>>> as MB, which seems to be how this will be implemented. If there is indeed a confirmation
>>>> of this from AMD architecture then we can do something like this in resctrl.
>>>
>>> I don't see this being an issue. I will get consensus on it.
>>>
>>> I am wondering about the time frame and who is leading this change. Not sure if that is been discussed already.
>>> I can definitely help.
>>
>> A couple of features depend on the new schema descriptions as well as support for multiple
>> controls: min/max bandwidth controls on the MPAM side, region aware MBA and MBM on the Intel
>> side, and GLBE on the AMD side. I am hoping that the folks working on these features can
>> collaborate on the needed foundation. Since there are no patches for this yet I cannot say
>> if there is a leader for this work yet, at this time this role appears to be available if you
>> would like to see this moving forward in order to meet your goals.
>
>
> I joined this feature effort a bit later, so I may not yet have full context on the MPAM and region‑aware requirements. I’m happy to provide all the necessary information for GMB and MB from the AMD side, and I’m also available to help with reviews and testing.
I understand there is a lot involved. With so many folks dependent on this work I anticipate
that any effort will get support from the various content experts. Your knowledge of resctrl
fs will be valuable in this effort.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-10 16:17 ` Reinette Chatre
2026-02-10 18:04 ` Reinette Chatre
@ 2026-02-13 16:37 ` Moger, Babu
2026-02-13 17:02 ` Luck, Tony
2026-02-14 0:10 ` Reinette Chatre
1 sibling, 2 replies; 114+ messages in thread
From: Moger, Babu @ 2026-02-13 16:37 UTC (permalink / raw)
To: Reinette Chatre, Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Reinette,
On 2/10/2026 10:17 AM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>
>>
>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>> Babu,
>>>>
>>>> I've read a bit more of the code now and I think I understand more.
>>>>
>>>> Some useful additions to your explanation.
>>>>
>>>> 1) Only one CTRL group can be marked as PLZA
>>>
>>> Yes. Correct.
>
> Why limit it to one CTRL_MON group and why not support it for MON groups?
There can be only one PLZA configuration in a system. The values in the
MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID, CLOSID_EN) must
be identical across all logical processors. The only field that may
differ is PLZA_EN.
I was initially unsure which RMID should be used when PLZA is enabled on
MON groups.
After re-evaluating, enabling PLZA on MON groups is still feasible:
1. Only one group in the system can have PLZA enabled.
2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA on
MON group.
3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and RMID of
the CTRL_MON group can be written.
4. If PLZA is enabled on a MON group, then the CLOSID of the CTRL_MON
group can be used, while the RMID of the MON group can be written.
I am thinking this approach should work.
>
> Limiting it to a single CTRL group seems restrictive in a few ways:
> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> number of use cases that can be supported. Consider, for example, an existing
> "high priority" resource group and a "low priority" resource group. The user may
> just want to let the tasks in the "low priority" resource group run as "high priority"
> when in CPL0. This of course may depend on what resources are allocated, for example
> cache may need more care, but if, for example, user is only interested in memory
> bandwidth allocation this seems a reasonable use case?
> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> capable of in terms of number of different control groups/CLOSID that can be
> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> example, create a resource group that contains tasks of interest and create
> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> This will give user space better insight into system behavior and from what I can
> tell is supported by the feature but not enabled?
Yes, as long as PLZA is enabled on only one group in the entire system
>
>>>
>>>> 2) It can't be the root/default group
>>>
>>> This is something I added to keep the default group in a un-disturbed,
>
> Why was this needed?
>
With the new approach mentioned about we can enable in default group also.
>>>
>>>> 3) It can't have sub monitor groups
>
> Why not?
Ditto. With the new approach mentioned about we can enable in default
group also.
>
>>>> 4) It can't be pseudo-locked
>>>
>>> Yes.
>>>
>>>>
>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>> need to change.
>>>
>>> Yes. That can be one use case.
>>>
>>>>
>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>> do:
>>>>
>>>> # echo '*' > tasks
>
> Dedicating a resource group to "PLZA" seems restrictive while also adding many
> complications since this designation makes resource group behave differently and
> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>
> I am wondering if it will not be simpler to introduce just one new file, for example
> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> resource group to manage user space and kernel space allocations while also supporting
> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> use case where user space can create a new resource group with certain allocations but the
> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> the resource group's allocations when in CPL0.
Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
We need make sure only one group can configured in the system and not
allow in other groups when it is already enabled.
Thanks
Babu
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-13 16:37 ` Moger, Babu
@ 2026-02-13 17:02 ` Luck, Tony
2026-02-16 19:24 ` Babu Moger
2026-02-14 0:10 ` Reinette Chatre
1 sibling, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-02-13 17:02 UTC (permalink / raw)
To: Moger, Babu
Cc: Reinette Chatre, Moger, Babu, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
On Fri, Feb 13, 2026 at 10:37:48AM -0600, Moger, Babu wrote:
> Hi Reinette,
>
> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
> > Hi Babu,
> >
> > On 1/28/26 9:44 AM, Moger, Babu wrote:
> > >
> > >
> > > On 1/28/2026 11:41 AM, Moger, Babu wrote:
> > > > > On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> > > > > > On 1/27/2026 4:30 PM, Luck, Tony wrote:
> > > > > Babu,
> > > > >
> > > > > I've read a bit more of the code now and I think I understand more.
> > > > >
> > > > > Some useful additions to your explanation.
> > > > >
> > > > > 1) Only one CTRL group can be marked as PLZA
> > > >
> > > > Yes. Correct.
> >
> > Why limit it to one CTRL_MON group and why not support it for MON groups?
>
> There can be only one PLZA configuration in a system. The values in the
> MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID, CLOSID_EN) must be
> identical across all logical processors. The only field that may differ is
> PLZA_EN.
>
> I was initially unsure which RMID should be used when PLZA is enabled on MON
> groups.
>
> After re-evaluating, enabling PLZA on MON groups is still feasible:
>
> 1. Only one group in the system can have PLZA enabled.
> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA on MON
> group.
> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and RMID of the
> CTRL_MON group can be written.
> 4. If PLZA is enabled on a MON group, then the CLOSID of the CTRL_MON group
> can be used, while the RMID of the MON group can be written.
>
> I am thinking this approach should work.
I can see why a user might want to accumulate all kerrnel resource usage
in one RMID, separately from application resource usage. But wanting to
subdivide that between different tasks seems a stretch.
Remember that there are 3 main reasons why the kernel may be entered
while an application is running:
1) Application makes a system call
2) A trap or fault (most common = pagefault?)
3) An interrupt
The application has some limited control over 1 & 2. None at
all over 3.
So I'd like to hear some real use cases before resctrl commits
to adding this complexity.
>
> >
> > Limiting it to a single CTRL group seems restrictive in a few ways:
> > 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> > number of use cases that can be supported. Consider, for example, an existing
> > "high priority" resource group and a "low priority" resource group. The user may
> > just want to let the tasks in the "low priority" resource group run as "high priority"
> > when in CPL0. This of course may depend on what resources are allocated, for example
> > cache may need more care, but if, for example, user is only interested in memory
> > bandwidth allocation this seems a reasonable use case?
> > 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> > capable of in terms of number of different control groups/CLOSID that can be
> > assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> > 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> > MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> > example, create a resource group that contains tasks of interest and create
> > a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> > This will give user space better insight into system behavior and from what I can
> > tell is supported by the feature but not enabled?
>
>
> Yes, as long as PLZA is enabled on only one group in the entire system
>
> >
> > > >
> > > > > 2) It can't be the root/default group
> > > >
> > > > This is something I added to keep the default group in a un-disturbed,
> >
> > Why was this needed?
> >
>
> With the new approach mentioned about we can enable in default group also.
>
> > > >
> > > > > 3) It can't have sub monitor groups
> >
> > Why not?
>
> Ditto. With the new approach mentioned about we can enable in default group
> also.
>
> >
> > > > > 4) It can't be pseudo-locked
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > Would a potential use case involve putting *all* tasks into the PLZA group? That
> > > > > would avoid any additional context switch overhead as the PLZA MSR would never
> > > > > need to change.
> > > >
> > > > Yes. That can be one use case.
> > > >
> > > > >
> > > > > If that is the case, maybe for the PLZA group we should allow user to
> > > > > do:
> > > > >
> > > > > # echo '*' > tasks
> >
> > Dedicating a resource group to "PLZA" seems restrictive while also adding many
> > complications since this designation makes resource group behave differently and
> > thus the files need to get extra "treatments" to handle this "PLZA" designation.
> >
> > I am wondering if it will not be simpler to introduce just one new file, for example
> > "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> > file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> > task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> > resource group to manage user space and kernel space allocations while also supporting
> > various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> > use case where user space can create a new resource group with certain allocations but the
> > "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> > the resource group's allocations when in CPL0.
>
> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>
> We need make sure only one group can configured in the system and not allow
> in other groups when it is already enabled.
>
> Thanks
> Babu
>
> >
> > Reinette
> >
> > [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
> >
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-13 16:17 ` Reinette Chatre
@ 2026-02-13 23:14 ` Moger, Babu
2026-02-14 0:01 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Moger, Babu @ 2026-02-13 23:14 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/13/2026 10:17 AM, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/12/26 5:51 PM, Moger, Babu wrote:
>> On 2/12/2026 6:05 PM, Reinette Chatre wrote:
>>> On 2/12/26 11:09 AM, Babu Moger wrote:
>>>> On 2/11/26 21:51, Reinette Chatre wrote:
>>>>> On 2/11/26 1:18 PM, Babu Moger wrote:
>>>>>> On 2/11/26 10:54, Reinette Chatre wrote:
>>>>>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>>>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>
>>> ...
>>>
>>>>>>> Another question, when setting aside possible differences between MB and GMB.
>>>>>>>
>>>>>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>>>>>
>>>>>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>>>>>> same:
>>>>>>>
>>>>>>> # cat schemata
>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>>
>>>>>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>>>>>> MB limit:
>>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>>> # cat schemata
>>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>>>>> Thank you for confirming.
>>>>>
>>>>>>> ... and then when user space resets GMB the MB can reset like ...
>>>>>>>
>>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>>> # cat schemata
>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>>
>>>>>>> if I understand correctly this will only apply if the MB limit was never set so
>>>>>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>>>>>
>>>>>>> # cat schemata
>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>
>>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>>> # cat schemata
>>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>
>>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>>> # cat schemata
>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>
>>>>>>> What would be most intuitive way for user to interact with the interfaces?
>>>>>> I see that you are trying to display the effective behaviors above.
>>>>> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
>>>>> what would be a reasonable expectation from resctrl be during these interactions.
>>>>>
>>>>>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>>>>> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
>>>>> settings may cause confusion?
>>>>
>>>> I mean in many cases, we cannot determine the effective settings correctly. It depends on benchmarks or applications running on the system.
>>>>
>>>> Even with MB (without GMB support), even though we set the limit to 10GB, it may not use the whole 10GB. Memory is shared resource. So, the effective bandwidth usage depends on other applications running on the system.
>>>
>>> Sounds like we interpret "effective limits" differently. To me the limits(*) are deterministic.
>>> If I understand correctly, if the GMB limit for domains A and B is set to x GB then that places
>>> an x GB limit on MB for domains A and B also. Displaying any MB limit in the schemata that is
>>> larger than x GB for domain A or domain B would be inaccurate, no?
>>
>> Yea. But, I was thinking not to mess with values written at registers.
>
> This is not about what is written to the registers but how the combined values
> written to registers control system behavior and how to accurately reflect the
> resulting system behavior to user space.
>
>>> When considering your example where the MB limit is 10GB.
>>>
>>> Consider an example where there are two domains in this example with a configuration like below.
>>> (I am using a different syntax from schemata file that will hopefully make it easier to exchange
>>> ideas when not having to interpret the different GMB and MB units):
>>>
>>> MB:0=10GB;1=10GB
>>>
>>> If user space can create a GMB domain that limits shared bandwidth to 10GB that can be displayed
>>> as below and will be accurate:
>>>
>>> MB:0=10GB;1=10GB
>>> GMB:0=10GB;1=10GB
>>>
>>> If user space then reduces the combined bandwidth to 2GB then the MB limit is wrong since it
>>> is actually capped by the GMB limit:
>>>
>>> MB:0=10GB;1=10GB <==== Does reflect possible per-domain memory bandwidth which is now capped by GMB
>>> GMB:0=2GB;1=2GB
>>>
>>> Would something like below not be more accurate that reflects that the maximum average bandwidth
>>> each domain could achieve is 2GB?
>>>
>>> MB:0=2GB;1=2GB <==== Reflects accurate possible per-domain memory bandwidth
>>> GMB:0=2GB;1=2GB
>>
>> That is reasonable. Will check how we can accommodate that.
>
> Right, this is not about the values in the L3BE registers but instead how those values
> are impacted by GLBE registers and how to most accurately present the resulting system
> configuration to user space. Thank you for considering.
I responded too quickly earlier—an internal discussion surfaced several
concerns with this approach.
schemata represents what user space explicitly configured and what the
hardware registers contain, not a derived “effective” value that depends
on runtime conditions.
Combining configured limits (MB/GMB) with effective bandwidth—which is
inherently workload‑dependent—blurs semantics, breaks existing
assumptions, and makes debugging more difficult.
MB and GMB use different units and encodings, so auto‑deriving values
can introduce rounding issues and loss of precision.
I’ll revisit this and come back with a refined proposal.
>
>>
>>>
>>> (*) As a side-note we may have to start being careful with how we use "limits" because of the planned
>>> introduction of a "MAX" as a bandwidth control that is an actual limit as opposed to the
>>> current control that is approximate.
>>>
>>>>>> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
>>>>> Yes, this will require resctrl to maintain more state.
>>>>>
>>>>> Documenting behavior is an option but I think we should first consider if there are things
>>>>> resctrl can do to make the interface intuitive to use.
>>>>>
>>>>>>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>>>>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>>>>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>>>>>>
>>>>>>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>>>>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>>>>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>>>>>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>>>>>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>>>>>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>>>>>>> I hope this clarifies your question.
>>>>>>> No. When enumerating the features the number of CLOSID supported by each is
>>>>>>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>>>>>>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
>>>>>> No. There is not such scenario.
>>>>>>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>>>>>>> scenarios where some resource groups can support global AND per-domain limits while other
>>>>>>> resource groups can just support global or just support per-domain limits. Is this correct?
>>>>>> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
>>>>> It is not a concern to have different CLOSIDs between resources that are actually different,
>>>>> for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
>>>>> allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
>>>>> challenge though. Would it be possible to have a snippet in the spec that explicitly states
>>>>> that MB and GMB will always enumerate with the same number of CLOSIDs?
>>>>
>>>> I have confirmed that is the case always. All current and planned implementations, MB and GMB will have the same number of CLOSIDs.
>>>
>>> Thank you very much for confirming. Is this something the architects would be willing to
>>> commit to with a snippet in the PQoS spec?
>>
>> I checked on that. Here is the response.
>>
>> "I do not plan to add a statement like that to the spec. The CPUID enumeration allows for them to have different number of CLOS's supported for each. However, it is true that for all current and planned implementations, MB and GMB will have the same number of CLOS."
>
> Thank you for asking. At this time the definition of a resource's "num_closids" is:
>
> "num_closids":
> The number of CLOSIDs which are valid for this
> resource. The kernel uses the smallest number of
> CLOSIDs of all enabled resources as limit.
>
> Without commitment from architecture we could expand definition of "num_closids" when
> adding multiple controls to indicate that it is the smallest number of CLOSIDs supported
> by all controls.
Yes. Agree.
Thanks
Babu
>
>>>>> Please see below where I will try to support this request more clearly and you can decide if
>>>>> it is reasonable.
>>>>>
>>>>>>>>> can be seen as a single "resource" that can be allocated differently based on
>>>>>>>>> the various schemata associated with that resource. This currently has a
>>>>>>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>>>>>>> may be something that we can reconsider?
>>>>>>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>>>>>>> The new approach is not final so please provide feedback to help improve it so
>>>>>>> that the features you are enabling can be supported well.
>>>>>> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
>>>>> It benefits all architectures.
>>>>>
>>>>> There are two parts to the current proposals.
>>>>>
>>>>> Part 1: Generic schema description
>>>>> I believe there is consensus on this approach. This is actually something that is long
>>>>> overdue and something like this would have been a great to have with the initial AMD
>>>>> enabling. With the generic schema description forming part of resctrl the user can learn
>>>>> from resctrl how to interact with the schemata file instead of relying on external information
>>>>> and documentation.
>>>>
>>>> ok.
>>>>
>>>>> For example, on an Intel system that uses percentage based proportional allocation for memory
>>>>> bandwidth the new resctrl files will display:
>>>>> info/MB/resource_schemata/MB/type:scalar linear
>>>>> info/MB/resource_schemata/MB/unit:all
>>>>> info/MB/resource_schemata/MB/scale:1
>>>>> info/MB/resource_schemata/MB/resolution:100
>>>>> info/MB/resource_schemata/MB/tolerance:0
>>>>> info/MB/resource_schemata/MB/max:100
>>>>> info/MB/resource_schemata/MB/min:10
>>>>>
>>>>>
>>>>> On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
>>>>> info/MB/resource_schemata/MB/type:scalar linear
>>>>> info/MB/resource_schemata/MB/unit:GBps
>>>>> info/MB/resource_schemata/MB/scale:1
>>>>> info/MB/resource_schemata/MB/resolution:8
>>>>> info/MB/resource_schemata/MB/tolerance:0
>>>>> info/MB/resource_schemata/MB/max:2048
>>>>> info/MB/resource_schemata/MB/min:1
>>>>>
>>>>> Having such interface will be helpful today. Users do not need to first figure out
>>>>> whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
>>>>> before interacting with resctrl. resctrl will be the generic interface it intends to be.
>>>>
>>>> Yes. That is a good point.
>>>>
>>>>> Part 2: Supporting multiple controls for a single resource
>>>>> This is a new feature on which there also appears to be consensus that is needed by MPAM and
>>>>> Intel RDT where it is possible to use different controls for the same resource. For example,
>>>>> there can be a minimum and maximum control associated with the memory bandwidth resource.
>>>>>
>>>>> For example,
>>>>> info/
>>>>> └─ MB/
>>>>> └─ resource_schemata/
>>>>> ├─ MB/
>>>>> ├─ MB_MIN/
>>>>> ├─ MB_MAX/
>>>>> ┆
>>>>>
>>>>>
>>>>> Here is where the big question comes in for GLBE - is this actually a new resource
>>>>> for which resctrl needs to add interfaces to manage its allocation, or is it instead
>>>>> an additional control associated with the existing memory bandwith resource?
>>>>
>>>> It is not a new resource. It is new control mechanism to address limitation with memory bandwidth resource.
>>>>
>>>> So, it is a new control for the existing memory bandwidth resource.
>>>
>>> Thank you for confirming.
>>>
>>>>
>>>>> For me things are actually pointing to GLBE not being a new resource but instead being
>>>>> a new control for the existing memory bandwidth resource.
>>>>>
>>>>> I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
>>>>> done in this series but when considering it as an actual unique resource does not seem
>>>>> appropriate since resctrl already has a "memory bandwidth" resource. User space expects
>>>>> to find all the resources that it can allocate in info/ - I do not think it is correct
>>>>> to have two separate directories/resources for memory bandwidth here.
>>>>>
>>>>> What if, instead, it looks something like:
>>>>>
>>>>> info/
>>>>> └── MB/
>>>>> └── resource_schemata/
>>>>> ├── GMB/
>>>>> │ ├──max:4096
>>>>> │ ├──min:1
>>>>> │ ├──resolution:1
>>>>> │ ├──scale:1
>>>>> │ ├──tolerance:0
>>>>> │ ├──type:scalar linear
>>>>> │ └──unit:GBps
>>>>> └── MB/
>>>>> ├──max:8192
>>>>> ├──min:1
>>>>> ├──resolution:8
>>>>> ├──scale:1
>>>>> ├──tolerance:0
>>>>> ├──type:scalar linear
>>>>> └──unit:GBps
>>>>
>>>> Yes. It definitely looks very clean.
>>>>
>>>>> With an interface like above GMB is just another control/schema used to allocate the
>>>>> existing memory bandwidth resource. With the planned files it is possible to express the
>>>>> different maximums and units used by the MB and GMB schema. Users no longer need to
>>>>> dig for the unit information in the docs, it is available in the interface.
>>>>
>>>>
>>>> Yes. That is reasonable.
>>>>
>>>> Is the plan to just update the resource information in /sys/fs/resctrl/info/<resource_name> ?
>>>
>>> I do not see any resource information that needs to change. As you confirmed,
>>> MB and GMB have the same number of CLOSIDs and looking at the rest of the
>>> enumeration done in patch #2 all other properties exposed in top level of
>>> /sys/fs/resctrl/info/MB is the same for MB and GMB. Specifically,
>>> thread_throttle_mode, delay_linear, min_bandwidth, and bandwidth_gran have
>>> the same values for MB and GMB. All other content in
>>> /sys/fs/resctrl/info/MB would be new as part of the new "resource_schemata"
>>> sub-directory.
>>>
>>> Even so, I believe we could expect that a user using any new schemata file entry
>>> introduced after the "resource_schemata" directory is introduced is aware of how
>>> the properties are exposed and will not use the top level files in /sys/fs/resctrl/info/MB
>>> (for example min_bandwidth and bandwidth_gran) to understand how to interact with
>>> the new schema.
>>>
>>>
>>>>
>>>> Also, will the display of /sys/fs/resctrl/schemata change ?
>>>
>>> There are no plans to change any of the existing schemata file entries.
>>>
>>>>
>>>> Current display:
>>>
>>> When viewing "current" as what this series does in schemata file ...
>>>
>>>>
>>>> GMB:0=4096;1=4096;2=4096;3=4096
>>>> MB:0=8192;1=8192;2=8192;3=8192
>>>
>>> yes, the schemata file should look like this on boot when all is done. All other
>>> user facing changes are to the info/ directory where user space learns about
>>> the new control for the resource and how to interact with the control.
>>>
>>>>> Doing something like this does depend on GLBE supporting the same number of CLOSIDs
>>>>> as MB, which seems to be how this will be implemented. If there is indeed a confirmation
>>>>> of this from AMD architecture then we can do something like this in resctrl.
>>>>
>>>> I don't see this being an issue. I will get consensus on it.
>>>>
>>>> I am wondering about the time frame and who is leading this change. Not sure if that is been discussed already.
>>>> I can definitely help.
>>>
>>> A couple of features depend on the new schema descriptions as well as support for multiple
>>> controls: min/max bandwidth controls on the MPAM side, region aware MBA and MBM on the Intel
>>> side, and GLBE on the AMD side. I am hoping that the folks working on these features can
>>> collaborate on the needed foundation. Since there are no patches for this yet I cannot say
>>> if there is a leader for this work yet, at this time this role appears to be available if you
>>> would like to see this moving forward in order to meet your goals.
>>
>>
>> I joined this feature effort a bit later, so I may not yet have full context on the MPAM and region‑aware requirements. I’m happy to provide all the necessary information for GMB and MB from the AMD side, and I’m also available to help with reviews and testing.
>
> I understand there is a lot involved. With so many folks dependent on this work I anticipate
> that any effort will get support from the various content experts. Your knowledge of resctrl
> fs will be valuable in this effort.
>
> Reinette
>
>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-13 23:14 ` Moger, Babu
@ 2026-02-14 0:01 ` Reinette Chatre
2026-02-16 16:05 ` Babu Moger
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-14 0:01 UTC (permalink / raw)
To: Moger, Babu, Babu Moger, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Babu,
On 2/13/26 3:14 PM, Moger, Babu wrote:
> Hi Reinette,
>
>
> On 2/13/2026 10:17 AM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 2/12/26 5:51 PM, Moger, Babu wrote:
>>> On 2/12/2026 6:05 PM, Reinette Chatre wrote:
>>>> On 2/12/26 11:09 AM, Babu Moger wrote:
>>>>> On 2/11/26 21:51, Reinette Chatre wrote:
>>>>>> On 2/11/26 1:18 PM, Babu Moger wrote:
>>>>>>> On 2/11/26 10:54, Reinette Chatre wrote:
>>>>>>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>>>>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>>>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>>
>>>> ...
>>>>
>>>>>>>> Another question, when setting aside possible differences between MB and GMB.
>>>>>>>>
>>>>>>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>>>>>>
>>>>>>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>>>>>>> same:
>>>>>>>>
>>>>>>>> # cat schemata
>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>
>>>>>>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>>>>>>> MB limit:
>>>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>>>> # cat schemata
>>>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>>>>>> Thank you for confirming.
>>>>>>
>>>>>>>> ... and then when user space resets GMB the MB can reset like ...
>>>>>>>>
>>>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>>>> # cat schemata
>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>
>>>>>>>> if I understand correctly this will only apply if the MB limit was never set so
>>>>>>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>>>>>>
>>>>>>>> # cat schemata
>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>>
>>>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>>>> # cat schemata
>>>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>>
>>>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>>>> # cat schemata
>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>>
>>>>>>>> What would be most intuitive way for user to interact with the interfaces?
>>>>>>> I see that you are trying to display the effective behaviors above.
>>>>>> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
>>>>>> what would be a reasonable expectation from resctrl be during these interactions.
>>>>>>
>>>>>>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>>>>>> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
>>>>>> settings may cause confusion?
>>>>>
>>>>> I mean in many cases, we cannot determine the effective settings correctly. It depends on benchmarks or applications running on the system.
>>>>>
>>>>> Even with MB (without GMB support), even though we set the limit to 10GB, it may not use the whole 10GB. Memory is shared resource. So, the effective bandwidth usage depends on other applications running on the system.
>>>>
>>>> Sounds like we interpret "effective limits" differently. To me the limits(*) are deterministic.
>>>> If I understand correctly, if the GMB limit for domains A and B is set to x GB then that places
>>>> an x GB limit on MB for domains A and B also. Displaying any MB limit in the schemata that is
>>>> larger than x GB for domain A or domain B would be inaccurate, no?
>>>
>>> Yea. But, I was thinking not to mess with values written at registers.
>>
>> This is not about what is written to the registers but how the combined values
>> written to registers control system behavior and how to accurately reflect the
>> resulting system behavior to user space.
>>
>>>> When considering your example where the MB limit is 10GB.
>>>>
>>>> Consider an example where there are two domains in this example with a configuration like below.
>>>> (I am using a different syntax from schemata file that will hopefully make it easier to exchange
>>>> ideas when not having to interpret the different GMB and MB units):
>>>>
>>>> MB:0=10GB;1=10GB
>>>>
>>>> If user space can create a GMB domain that limits shared bandwidth to 10GB that can be displayed
>>>> as below and will be accurate:
>>>>
>>>> MB:0=10GB;1=10GB
>>>> GMB:0=10GB;1=10GB
>>>>
>>>> If user space then reduces the combined bandwidth to 2GB then the MB limit is wrong since it
>>>> is actually capped by the GMB limit:
>>>>
>>>> MB:0=10GB;1=10GB <==== Does reflect possible per-domain memory bandwidth which is now capped by GMB
>>>> GMB:0=2GB;1=2GB
>>>>
>>>> Would something like below not be more accurate that reflects that the maximum average bandwidth
>>>> each domain could achieve is 2GB?
>>>>
>>>> MB:0=2GB;1=2GB <==== Reflects accurate possible per-domain memory bandwidth
>>>> GMB:0=2GB;1=2GB
>>>
>>> That is reasonable. Will check how we can accommodate that.
>>
>> Right, this is not about the values in the L3BE registers but instead how those values
>> are impacted by GLBE registers and how to most accurately present the resulting system
>> configuration to user space. Thank you for considering.
>
>
> I responded too quickly earlier—an internal discussion surfaced several concerns with this approach.
>
> schemata represents what user space explicitly configured and what the hardware registers contain, not a derived “effective” value that depends on runtime conditions.
> Combining configured limits (MB/GMB) with effective bandwidth—which is inherently workload‑dependent—blurs semantics, breaks existing assumptions, and makes debugging more difficult.
>
> MB and GMB use different units and encodings, so auto‑deriving values can introduce rounding issues and loss of precision.
>
> I’ll revisit this and come back with a refined proposal.
Are we still talking about below copied from https://lore.kernel.org/lkml/f0f2e3eb-0fdb-4498-9eb8-73111b1c5a84@amd.com/ ?
The MBA ceiling is applied at the QoS domain level.
The GLBE ceiling is applied at the GLBE control domain level.
If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-13 16:37 ` Moger, Babu
2026-02-13 17:02 ` Luck, Tony
@ 2026-02-14 0:10 ` Reinette Chatre
2026-02-16 15:41 ` Ben Horgan
2026-02-16 22:36 ` Moger, Babu
1 sibling, 2 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-14 0:10 UTC (permalink / raw)
To: Moger, Babu, Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Babu,
On 2/13/26 8:37 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>
>>>
>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>> Babu,
>>>>>
>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>
>>>>> Some useful additions to your explanation.
>>>>>
>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>
>>>> Yes. Correct.
>>
>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>
> There can be only one PLZA configuration in a system. The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID, CLOSID_EN) must be identical across all logical processors. The only field that may differ is PLZA_EN.
ah - this is a significant part that I missed. Since this is a per-CPU register it seems
to have the ability for expanded use in the future where different CLOSID and RMID may be
written to it? Is PLZA leaving room for such future enhancement or does the spec contain
the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN,
CLOSID, CLOSID_EN) must be identical across all logical processors."? That is, "forever
and always"?
If I understand correctly MPAM could have different PARTID and PMG for kernel use so we
need to consider these different architectural behaviors.
> I was initially unsure which RMID should be used when PLZA is enabled on MON groups.
>
> After re-evaluating, enabling PLZA on MON groups is still feasible:
>
> 1. Only one group in the system can have PLZA enabled.
> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA on MON group.
> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and RMID of the CTRL_MON group can be written.
> 4. If PLZA is enabled on a MON group, then the CLOSID of the CTRL_MON group can be used, while the RMID of the MON group can be written.
>
> I am thinking this approach should work.
>
>>
>> Limiting it to a single CTRL group seems restrictive in a few ways:
>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>> number of use cases that can be supported. Consider, for example, an existing
>> "high priority" resource group and a "low priority" resource group. The user may
>> just want to let the tasks in the "low priority" resource group run as "high priority"
>> when in CPL0. This of course may depend on what resources are allocated, for example
>> cache may need more care, but if, for example, user is only interested in memory
>> bandwidth allocation this seems a reasonable use case?
>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>> capable of in terms of number of different control groups/CLOSID that can be
>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>> example, create a resource group that contains tasks of interest and create
>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>> This will give user space better insight into system behavior and from what I can
>> tell is supported by the feature but not enabled?
>
>
> Yes, as long as PLZA is enabled on only one group in the entire system
>
>>
>>>>
>>>>> 2) It can't be the root/default group
>>>>
>>>> This is something I added to keep the default group in a un-disturbed,
>>
>> Why was this needed?
>>
>
> With the new approach mentioned about we can enable in default group also.
>
>>>>
>>>>> 3) It can't have sub monitor groups
>>
>> Why not?
>
> Ditto. With the new approach mentioned about we can enable in default group also.
>
>>
>>>>> 4) It can't be pseudo-locked
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>> need to change.
>>>>
>>>> Yes. That can be one use case.
>>>>
>>>>>
>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>> do:
>>>>>
>>>>> # echo '*' > tasks
>>
>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>> complications since this designation makes resource group behave differently and
>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>>
>> I am wondering if it will not be simpler to introduce just one new file, for example
>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>> resource group to manage user space and kernel space allocations while also supporting
>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>> use case where user space can create a new resource group with certain allocations but the
>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>> the resource group's allocations when in CPL0.
>
> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>
> We need make sure only one group can configured in the system and not allow in other groups when it is already enabled.
As I understand this means that only one group can have content in its
tasks_cpl0/tasks_kernel file. There should not be any special handling for
the remaining files of the resource group since the resource group is not
dedicated to kernel work and can be used as a user space resource group also.
If user space wants to create a dedicated kernel resource group there can be
a new resource group with an empty tasks file.
hmmm ... but if user space writes a task ID to a tasks_cpl0/tasks_kernel file then
resctrl would need to create new syntax to remove that task ID.
Possibly MPAM can build on this by allowing user space to write to multiple
tasks_cpl0/tasks_kernel files? (and the next version of PLZA may too)
Reinette
>
> Thanks
> Babu
>
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-12 18:37 ` Reinette Chatre
@ 2026-02-16 15:18 ` Ben Horgan
2026-02-17 18:51 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-16 15:18 UTC (permalink / raw)
To: Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Reinette,
On Thu, Feb 12, 2026 at 10:37:21AM -0800, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/12/26 5:55 AM, Ben Horgan wrote:
> > On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
> >> On 2/11/26 8:40 AM, Ben Horgan wrote:
> >>> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
> >>>> On 2/10/26 8:17 AM, Reinette Chatre wrote:
> >>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
> >>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> >>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
> >>>>>>>> Babu,
> >>>>>>>>
> >>>>>>>> I've read a bit more of the code now and I think I understand more.
> >>>>>>>>
> >>>>>>>> Some useful additions to your explanation.
> >>>>>>>>
> >>>>>>>> 1) Only one CTRL group can be marked as PLZA
> >>>>>>>
> >>>>>>> Yes. Correct.
> >>>>>
> >>>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
> >>>>>
> >>>>> Limiting it to a single CTRL group seems restrictive in a few ways:
> >>>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> >>>>> number of use cases that can be supported. Consider, for example, an existing
> >>>>> "high priority" resource group and a "low priority" resource group. The user may
> >>>>> just want to let the tasks in the "low priority" resource group run as "high priority"
> >>>>> when in CPL0. This of course may depend on what resources are allocated, for example
> >>>>> cache may need more care, but if, for example, user is only interested in memory
> >>>>> bandwidth allocation this seems a reasonable use case?
> >>>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> >>>>> capable of in terms of number of different control groups/CLOSID that can be
> >>>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> >>>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> >>>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> >>>>> example, create a resource group that contains tasks of interest and create
> >>>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> >>>>> This will give user space better insight into system behavior and from what I can
> >>>>> tell is supported by the feature but not enabled?
> >>>>>
> >>>>>>>
> >>>>>>>> 2) It can't be the root/default group
> >>>>>>>
> >>>>>>> This is something I added to keep the default group in a un-disturbed,
> >>>>>
> >>>>> Why was this needed?
> >>>>>
> >>>>>>>
> >>>>>>>> 3) It can't have sub monitor groups
> >>>>>
> >>>>> Why not?
> >>>>>
> >>>>>>>> 4) It can't be pseudo-locked
> >>>>>>>
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
> >>>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
> >>>>>>>> need to change.
> >>>>>>>
> >>>>>>> Yes. That can be one use case.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> If that is the case, maybe for the PLZA group we should allow user to
> >>>>>>>> do:
> >>>>>>>>
> >>>>>>>> # echo '*' > tasks
> >>>>>
> >>>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
> >>>>> complications since this designation makes resource group behave differently and
> >>>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
> >
> > As I commented on another thread, I'm wary of this reuse of existing file types
> > as they can confuse existing user-space tools.
>
> I agree. Changing how user space interacts with existing files is a change that would
> require a mount option and this can be avoided by using new files instead.
>
> >>>>> I am wondering if it will not be simpler to introduce just one new file, for example
> >>>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> >>>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> >>>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> >>>>> resource group to manage user space and kernel space allocations while also supporting
> >>>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> >>>>> use case where user space can create a new resource group with certain allocations but the
> >>>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> >>>>> the resource group's allocations when in CPL0.
> >>>
> >>> If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
> >>
> >> That is reasonable, yes.
> >
> > I think the "tasks_cpl0" approach suffers from one of the same faults as the
> > "kernel_groups" approach. If you want to run a task with userspace configuration
> > closid-A rmid-Y but to run in kernel space in closid-B but the same rmid-Y then
> > there can't exist monitor_group in resctrl for both.
>
> This assumes that "tasks" and "tasks_cpl0"/"tasks_kernel" have the same rules for
> task assignment. When a user assigns a task to the "tasks" file of a MON group it
> is required that the task is a member of the parent CTRL_MON group and if so, that
> task's CLOSID and RMID are both updated. Theoretically there could be different rules
> for task assignment to the "tasks_cpl0"/"tasks_kernel" file that does not place such
> restriction and only updates CLOSID when moving to a CTRL_MON group and only updates
> RMID when moving to a MON group.
>
> You are correct that resctrl cannot have monitor groups to track such configuration
> and there may indeed be some consequences that I have not considered.
>
> I understand this is not something that MPAM can support and I also do not know if this
> is even a valid use case. If doing something like this user space will need to take care
> since the monitoring data will be presented with the allocations used when tasks are in
> user space but also contain the monitoring data for allocations used when tasks are in
> kernel space that are tracked in another control group hierarchy (to which I expect the
> task's kernel space monitoring can move when the MON group is deleted).
>
>
> >>>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
> >>>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
> >>>> instead of CPL0 using something like "kernel" or ... ?
> >>>
> >>> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
> >>> internally and here are a few thoughts.
> >>>
> >>> If the user case is just that an option run all tasks with the same closid/rmid
> >>> (partid/pmg) configuration when they are running in the kernel then I'd favour a
> >>> mount option. The resctrl filesytem interface doesn't need to change and
> >>
> >> I view mount options as an interface of last resort. Why would a mount option be needed
> >> in this case? The existence of the file used to configure the feature seems sufficient?
> >
> > If we are taking away a closid from the user then the number of CTRL_MON groups
> > that can be created changes. It seems reasonable for user-space to expect
> > num_closid to be a fixed value.
>
> I do you see why we need to take away a CLOSID from the user. Consider a user space that
Yes, just slightly simpler to take away a CLOSID but could just go with the
default CLOSID is also used for the kernel. I would be ok with a file saying the
mode, like the mbm_event file does for counter assignment. It slightly misleading
that a configuration file is under info but necessary as we don't have another
location global to the resctrl mount.
> runs with just two resource groups, for example, "high priority" and "low priority", it seems
> reasonable to make it possible to let the "low priority" tasks run with "high priority"
> allocations when in kernel space without needing to dedicate a new CLOSID? More reasonable
> when only considering memory bandwidth allocation though.
>
> >
> >>
> >> Also ...
> >>
> >> I do not think resctrl should unnecessarily place constraints on what the hardware
> >> features are capable of. As I understand, both PLZA and MPAM supports use case where
> >> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
> >> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
> >> This may be because I am not familiar with all the requirements here so please do
> >> help with insight on how the hardware feature is intended to be used as it relates
> >> to its design.
> >>
> >> We have to be very careful when constraining a feature this much If resctrl does something
> >> like this it essentially restricts what users could do forever.
> >
> > Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
> > fixed kernel CLOSID/RMID configuration option might just give all we need for
> > usecases we know we have and be minimally intrusive enough to not preclude a
> > more featureful PLZA later when new usecases come about.
>
> Having ability to grow features would be ideal. I do not see how a fixed kernel CLOSID/RMID
> configuration leaves room to build on top though. Could you please elaborate?
If we initially go with a single new configuration file, e.g. kernel_mode, which
could be "match_user" or "use_root, this would be the only initial change to the
interface needed. If more usecases present themselves a new mode could be added,
e.g. "configurable", and an interface to actually change the rmid/closid for the
kernel could be added.
>
> I wonder if the benefit of the fixed CLOSID/RMID is perhaps mostly in the cost of
> context switching which I do not think is a concern for MPAM but it may be for PLZA?
>
> One option to support fixed kernel CLOSID/RMID at the beginning and leave room to build
> may be to create the kernel_group or "tasks_kernel" interface as a baseline but in first
> implementation only allow user space to write the same group to all "kernel_group" files or
> to only allow to write to one of the "tasks_kernel" files in the resctrl fs hierarchy. At
> that time the associated CLOSID/RMID would become the "fixed configuration" and attempts to
> write to others can return "ENOSPC"?
I think we'd have to be sure of the final interface if we go this way.
>
> From what I can tell this still does not require to take away a CLOSID/RMID from user space
> though. Dedicating a CLOSID/RMID to kernel work can still be done but be in control of user
> that can, for example leave the "tasks" and "cpus" files empty.
>
> > One complication is that for fixed kernel CLOSID/RMID option is that for x86 you
> > may want to be able to monitor a tasks resource usage whether or not it is in
> > the kernel or userspace and so only have a fixed CLOSID. However, for MPAM this
> > wouldn't work as PMG (~RMID) is scoped to PARTID (~CLOSID).
> >
> >>
> >>> userspace software doesn't need to change. This could either take away a
> >>> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
> >>> policy to have the default group as the kernel group. If you use the default
> >>
> >> Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
> >> between user space and kernel. I do not see a motivation for resctrl to place such
> >> constraint.
> >>
> >>> configuration, at least for MPAM, the kernel may not be running at the highest
> >>> priority as a minimum bandwidth can be used to give a priority boost. (Once we
> >>> have a resctrl schema for this.)
> >>>
> >>> It could be useful to have something a bit more featureful though. Is there a
> >>> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
> >>> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
> >>> it's not a single write to move a task. If a single mapping is sufficient, then
> >>
> >> Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
> >> I think the MPAM approach is better and there may be opportunity to do this in a similar
> >> way and both architectures use the same field(s) in the task_struct.
> >
> > I was referring to the userspace file write but unifying on a the same fields in
> > task_struct could be good. The single write is necessary for MPAM as PMG is
> > scoped to PARTID and I don't think x86 behaviour changes if it moves to the same
> > approach.
> >
>
> ah - I misunderstood. You are suggesting to have one file that user writes to
> to set both user space and kernel space CLOSID/RMID? This sounds like what the
Yes, the kernel_groups idea does partially have this as once you've set the
kernel_group for a CTRL_MON or MON group then the user space configuration
dictates the kernel space configuration. As you pointed out, this is also
a draw back of the kernel_groups idea.
> existing "tasks" file does but only supports the same CLOSID/RMID for both user
> space and kernel space. To support the new hardware features where the CLOSID/RMID
> can be different we cannot just change "tasks" interface and would need to keep it
> backward compatible. So far I assumed that it would be ok for the "tasks" file
> to essentially get new meaning as the CLOSID/RMID for just user space work, which
> seems to require a second file for kernel space as a consequence? So far I have
> not seen an option that does not change meaning of the "tasks" file.
Would it make sense to have some new type of entries in the tasks file,
e.g. k_ctrl_<pid>, k_mon_<pid> to say, in the kernel, use the closid of this
CTRL_MON for this task pid or use the rmid of this CTRL_MON/MON group for this task
pid? We would still probably need separate files for the cpu configuration.
If separate files make more sense, then we might need 2 extra tasks files to
decouple closid and rmid, e.g. tasks_k_ctrl and task_k_mon. The task_k_mon would
be in all CTRL_MON and MON groups and determine the rmid and tasks_k_ctrl just
in a CTRL_MON group and determine a closid.
>
> >>> as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
> >>> suggested above but rather than a task that file could hold a path to the
> >>> CTRL_MON/MON group that provides the kernel configuraion for tasks running in
> >>> that group. So that this can be transparent to existing software an empty string
> >>
> >> Something like this would force all tasks of a group to run with the same CLOSID/RMID
> >> (PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
> >> and may reduce the possible use case of this feature.
> >>
> >> For example,
> >> - There may be a scenario where there is a set of tasks with a particular allocation
> >> when running in user space but when in kernel these tasks benefit from different
> >> allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
> >> user space with allocations from resource_groupA. While these tasks are ok with this
> >> allocation when in user space they have different requirements when it comes to
> >> kernel space. There may be a resource_groupB that allocates a lot of resources ("high
> >> priority") that task 1 should use for kernel work and a resource_groupC that allocates
> >> fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
> >>
> >> resource_groupA:
> >> schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
> >> tasks when in user space: 1, 2, 3
> >>
> >> resource_groupB:
> >> schemata: <high priority allocations>
> >> tasks when in kernel space: 1
> >>
> >> resource_groupC:
> >> schemata: <medium priority allocations>
> >> tasks when in kernel space: 2, 3
> >
> > I'm not sure if this would happen in the real world or not.
>
> Ack. I would like to echo Tony's request for feedback from resctrl users
> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
Indeed. This is all getting a bit complicated.
Thanks,
Ben
>
> >
> >>
> >> If user space is forced to have the same tasks have the same user space and kernel
> >> allocations then that will force user space to create additional resource groups that
> >> will use up CLOSID/PARTID that is a scarce resource.
> >
> > This may be undesirable even if CLOSID/PARTID were unlimited as controls which set
> > a per-CLOSID/PARTID maximum don't have the same effect if the tasks are spread across
> > more than one CLOSID/PARTID.
>
> Thank you for bringing this up. I did not consider the mechanics of the memory bandwidth
> controls.
>
> >
> >>
> >> - There may be a scenario where the user is attempting to understand system behavior by
> >> monitoring individual or subsets of tasks' bandwidth usage when in kernel space.
> >
> > This seems useful to me.
> >
> >>
> >> - From what I can tell PLZA also supports *different* allocations when in user vs
> >> kernel space while using the *same* monitoring group for both. This does not seem
> >> transferable to MPAM and would take more effort to support in resctrl but it is
> >> a use case that the hardware enables.
> >
> > Ah yes, I think this ends the 'kernel_group' idea then. I was too focused on
> > MPAM and forgotten to consider the case where PMG and PARTID are independent.
>
> Of course we would want user space to have consistent experience from resctrl no matter the
> architecture so these places where architectures behave different needs more care.
>
> >> When enabling a feature I would of course prefer not to add unnecessary complexity. Even so,
> >> resctrl is expected to expose hardware capabilities to user space. There seems to be some
> >> opinions on how user space will now and forever interact with these features that
> >> are not clear to me so I would appreciate more insight in why these constraints are
> >> appropriate.
> >
> > Yes, care definitely needs to be taken here in order to not back ourselves into
> > a corner.
>
> I really appreciate the discussions to help create a useful interface.
>
> Reinette
>
> >
> >>
> >> Reinette
> >>
> >>> can mean use the current group's when in the kernel (as well as for
> >>> userspace). A slash, /, could be used to refer to the default group. This would
> >>> give something like the below under /sys/fs/resctrl.
> >>>
> >>> .
> >>> ├── cpus
> >>> ├── tasks
> >>> ├── ctrl1
> >>> │ ├── cpus
> >>> │ ├── kernel_group -> mon_groups/mon1
> >>> │ └── tasks
> >>> ├── kernel_group -> ctrl1
> >>> └── mon_groups
> >>> └── mon1
> >>> ├── cpus
> >>> ├── kernel_group -> ctrl1
> >>> └── tasks
> >>>
> >>>>
> >>>> I have not read anything about the RISC-V side of this yet.
> >>>>
> >>>> Reinette
> >>>>
> >>>>>
> >>>>> Reinette
> >>>>>
> >>>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
> >>>>
> >>>
> >>> Thanks,
> >>>
> >>> Ben
> >>
> >
> > Thanks,
> >
> > Ben
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-14 0:10 ` Reinette Chatre
@ 2026-02-16 15:41 ` Ben Horgan
2026-02-16 22:52 ` Moger, Babu
2026-02-16 22:36 ` Moger, Babu
1 sibling, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-16 15:41 UTC (permalink / raw)
To: Reinette Chatre, Moger, Babu, Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Babu, Reinette,
On 2/14/26 00:10, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/13/26 8:37 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>
>>>>
>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>> Babu,
>>>>>>
>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>
>>>>>> Some useful additions to your explanation.
>>>>>>
>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>
>>>>> Yes. Correct.
>>>
>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>>
>> There can be only one PLZA configuration in a system. The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID, CLOSID_EN) must be identical across all logical processors. The only field that may differ is PLZA_EN.
Does this have any effect on hypervisors?
>
> ah - this is a significant part that I missed. Since this is a per-CPU register it seems
I also missed that.
> to have the ability for expanded use in the future where different CLOSID and RMID may be
> written to it? Is PLZA leaving room for such future enhancement or does the spec contain
> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN,
> CLOSID, CLOSID_EN) must be identical across all logical processors."? That is, "forever
> and always"?
>
> If I understand correctly MPAM could have different PARTID and PMG for kernel use so we
> need to consider these different architectural behaviors.
Yes, MPAM has a per-cpu register MPAM1_EL1.
>
>> I was initially unsure which RMID should be used when PLZA is enabled on MON groups.
>>
>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>
>> 1. Only one group in the system can have PLZA enabled.
>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA on MON group.
>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and RMID of the CTRL_MON group can be written.
>> 4. If PLZA is enabled on a MON group, then the CLOSID of the CTRL_MON group can be used, while the RMID of the MON group can be written.
Given that CLOSID and RMID are fixed once in the PLZA configuration
could this be simplified by just assuming they have the values of the
default group, CLOSID=0 and RMID=0 and let the user base there
configuration on that?
>>
>> I am thinking this approach should work.
>>
>>>
>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>> number of use cases that can be supported. Consider, for example, an existing
>>> "high priority" resource group and a "low priority" resource group. The user may
>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>> cache may need more care, but if, for example, user is only interested in memory
>>> bandwidth allocation this seems a reasonable use case?
>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>> capable of in terms of number of different control groups/CLOSID that can be
>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>> example, create a resource group that contains tasks of interest and create
>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>> This will give user space better insight into system behavior and from what I can
>>> tell is supported by the feature but not enabled?
>>
>>
>> Yes, as long as PLZA is enabled on only one group in the entire system
>>
>>>
>>>>>
>>>>>> 2) It can't be the root/default group
>>>>>
>>>>> This is something I added to keep the default group in a un-disturbed,
>>>
>>> Why was this needed?
>>>
>>
>> With the new approach mentioned about we can enable in default group also.
>>
>>>>>
>>>>>> 3) It can't have sub monitor groups
>>>
>>> Why not?
>>
>> Ditto. With the new approach mentioned about we can enable in default group also.
>>
>>>
>>>>>> 4) It can't be pseudo-locked
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>> need to change.
>>>>>
>>>>> Yes. That can be one use case.
>>>>>
>>>>>>
>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>> do:
>>>>>>
>>>>>> # echo '*' > tasks
>>>
>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>> complications since this designation makes resource group behave differently and
>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>>>
>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>> resource group to manage user space and kernel space allocations while also supporting
>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>> use case where user space can create a new resource group with certain allocations but the
>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>> the resource group's allocations when in CPL0.
>>
>> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>>
>> We need make sure only one group can configured in the system and not allow in other groups when it is already enabled.
>
> As I understand this means that only one group can have content in its
> tasks_cpl0/tasks_kernel file. There should not be any special handling for
> the remaining files of the resource group since the resource group is not
> dedicated to kernel work and can be used as a user space resource group also.
> If user space wants to create a dedicated kernel resource group there can be
> a new resource group with an empty tasks file.
>
> hmmm ... but if user space writes a task ID to a tasks_cpl0/tasks_kernel file then
> resctrl would need to create new syntax to remove that task ID.
>
> Possibly MPAM can build on this by allowing user space to write to multiple
> tasks_cpl0/tasks_kernel files? (and the next version of PLZA may too)
>
> Reinette
>
>
>>
>> Thanks
>> Babu
>>
>>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>>
>>
>
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-14 0:01 ` Reinette Chatre
@ 2026-02-16 16:05 ` Babu Moger
0 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-02-16 16:05 UTC (permalink / raw)
To: Reinette Chatre, Moger, Babu, corbet, tony.luck, Dave.Martin,
james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/13/26 18:01, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/13/26 3:14 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>>
>> On 2/13/2026 10:17 AM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 2/12/26 5:51 PM, Moger, Babu wrote:
>>>> On 2/12/2026 6:05 PM, Reinette Chatre wrote:
>>>>> On 2/12/26 11:09 AM, Babu Moger wrote:
>>>>>> On 2/11/26 21:51, Reinette Chatre wrote:
>>>>>>> On 2/11/26 1:18 PM, Babu Moger wrote:
>>>>>>>> On 2/11/26 10:54, Reinette Chatre wrote:
>>>>>>>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>>>>>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>>>>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>>> ...
>>>>>
>>>>>>>>> Another question, when setting aside possible differences between MB and GMB.
>>>>>>>>>
>>>>>>>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>>>>>>>
>>>>>>>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>>>>>>>> same:
>>>>>>>>>
>>>>>>>>> # cat schemata
>>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>>
>>>>>>>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>>>>>>>> MB limit:
>>>>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>>>>> # cat schemata
>>>>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>>>>>>> Thank you for confirming.
>>>>>>>
>>>>>>>>> ... and then when user space resets GMB the MB can reset like ...
>>>>>>>>>
>>>>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>>>>> # cat schemata
>>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>>
>>>>>>>>> if I understand correctly this will only apply if the MB limit was never set so
>>>>>>>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>>>>>>>
>>>>>>>>> # cat schemata
>>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>>>
>>>>>>>>> # echo"GMB:0=8;2=8" > schemata
>>>>>>>>> # cat schemata
>>>>>>>>> GMB:0=8;1=2048;2=8;3=2048
>>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>>>
>>>>>>>>> # echo"GMB:0=2048;2=2048" > schemata
>>>>>>>>> # cat schemata
>>>>>>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>>>>>>> MB:0=8;1=2048;2=8;3=2048
>>>>>>>>>
>>>>>>>>> What would be most intuitive way for user to interact with the interfaces?
>>>>>>>> I see that you are trying to display the effective behaviors above.
>>>>>>> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
>>>>>>> what would be a reasonable expectation from resctrl be during these interactions.
>>>>>>>
>>>>>>>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>>>>>>> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
>>>>>>> settings may cause confusion?
>>>>>> I mean in many cases, we cannot determine the effective settings correctly. It depends on benchmarks or applications running on the system.
>>>>>>
>>>>>> Even with MB (without GMB support), even though we set the limit to 10GB, it may not use the whole 10GB. Memory is shared resource. So, the effective bandwidth usage depends on other applications running on the system.
>>>>> Sounds like we interpret "effective limits" differently. To me the limits(*) are deterministic.
>>>>> If I understand correctly, if the GMB limit for domains A and B is set to x GB then that places
>>>>> an x GB limit on MB for domains A and B also. Displaying any MB limit in the schemata that is
>>>>> larger than x GB for domain A or domain B would be inaccurate, no?
>>>> Yea. But, I was thinking not to mess with values written at registers.
>>> This is not about what is written to the registers but how the combined values
>>> written to registers control system behavior and how to accurately reflect the
>>> resulting system behavior to user space.
>>>
>>>>> When considering your example where the MB limit is 10GB.
>>>>>
>>>>> Consider an example where there are two domains in this example with a configuration like below.
>>>>> (I am using a different syntax from schemata file that will hopefully make it easier to exchange
>>>>> ideas when not having to interpret the different GMB and MB units):
>>>>>
>>>>> MB:0=10GB;1=10GB
>>>>>
>>>>> If user space can create a GMB domain that limits shared bandwidth to 10GB that can be displayed
>>>>> as below and will be accurate:
>>>>>
>>>>> MB:0=10GB;1=10GB
>>>>> GMB:0=10GB;1=10GB
>>>>>
>>>>> If user space then reduces the combined bandwidth to 2GB then the MB limit is wrong since it
>>>>> is actually capped by the GMB limit:
>>>>>
>>>>> MB:0=10GB;1=10GB <==== Does reflect possible per-domain memory bandwidth which is now capped by GMB
>>>>> GMB:0=2GB;1=2GB
>>>>>
>>>>> Would something like below not be more accurate that reflects that the maximum average bandwidth
>>>>> each domain could achieve is 2GB?
>>>>>
>>>>> MB:0=2GB;1=2GB <==== Reflects accurate possible per-domain memory bandwidth
>>>>> GMB:0=2GB;1=2GB
>>>> That is reasonable. Will check how we can accommodate that.
>>> Right, this is not about the values in the L3BE registers but instead how those values
>>> are impacted by GLBE registers and how to most accurately present the resulting system
>>> configuration to user space. Thank you for considering.
>>
>> I responded too quickly earlier—an internal discussion surfaced several concerns with this approach.
>>
>> schemata represents what user space explicitly configured and what the hardware registers contain, not a derived “effective” value that depends on runtime conditions.
>> Combining configured limits (MB/GMB) with effective bandwidth—which is inherently workload‑dependent—blurs semantics, breaks existing assumptions, and makes debugging more difficult.
>>
>> MB and GMB use different units and encodings, so auto‑deriving values can introduce rounding issues and loss of precision.
>>
>> I’ll revisit this and come back with a refined proposal.
> Are we still talking about below copied from https://lore.kernel.org/lkml/f0f2e3eb-0fdb-4498-9eb8-73111b1c5a84@amd.com/ ?
>
> The MBA ceiling is applied at the QoS domain level.
> The GLBE ceiling is applied at the GLBE control domain level.
> If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
Yes. That is correct.
The main challenge is debugging customer issues. AMD systems often
support multiple domains - sometimes as many as 16.
If we replace the MB values with the effective MB values across all
domains, we lose visibility into the actual MB settings programmed in
hardware. In most cases, we only have access to the schemata values, and
we cannot ask customers to run |rdmsr| to retrieve the real register
values. This makes it difficult to diagnose complex issues.
That is why we are recommending that the MB values remain unchanged.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-13 17:02 ` Luck, Tony
@ 2026-02-16 19:24 ` Babu Moger
0 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-02-16 19:24 UTC (permalink / raw)
To: Luck, Tony, Moger, Babu
Cc: Reinette Chatre, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Tony,
On 2/13/26 11:02, Luck, Tony wrote:
> On Fri, Feb 13, 2026 at 10:37:48AM -0600, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>
>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>> Babu,
>>>>>>
>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>
>>>>>> Some useful additions to your explanation.
>>>>>>
>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>> Yes. Correct.
>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>> There can be only one PLZA configuration in a system. The values in the
>> MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID, CLOSID_EN) must be
>> identical across all logical processors. The only field that may differ is
>> PLZA_EN.
>>
>> I was initially unsure which RMID should be used when PLZA is enabled on MON
>> groups.
>>
>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>
>> 1. Only one group in the system can have PLZA enabled.
>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA on MON
>> group.
>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and RMID of the
>> CTRL_MON group can be written.
>> 4. If PLZA is enabled on a MON group, then the CLOSID of the CTRL_MON group
>> can be used, while the RMID of the MON group can be written.
>>
>> I am thinking this approach should work.
> I can see why a user might want to accumulate all kerrnel resource usage
> in one RMID, separately from application resource usage. But wanting to
> subdivide that between different tasks seems a stretch.
>
> Remember that there are 3 main reasons why the kernel may be entered
> while an application is running:
>
> 1) Application makes a system call
> 2) A trap or fault (most common = pagefault?)
> 3) An interrupt
>
> The application has some limited control over 1 & 2. None at
> all over 3.
>
> So I'd like to hear some real use cases before resctrl commits
> to adding this complexity.
>
Imagine you have a strongly throttled thread going into the kernel and
grabbing a global lock in kernel. At the same time, an unthrottled high
priority thread
enters the kernel on any other CPU and tries to acquire the same lock.
Because the lock holder is slowed down, it runs slow inside the
critical section protected by the lock. The high priority thread is now
slowed down causing the priority inversion. We have seen this happening
in certain workloads.
The only way to avoid this problem is to ensure that any thread entering
the kernel operates with the same throttling level.
Only viable option is unlimited bandwidth in kernel for all the threads.
This means the kernel needs to either run on a different CLOSID or the
setting of the CLOSID
needs to change on kernel entry and exit at each entry point (syscall,
trap, fault, ...). We have tried manually changing the CLOSID during
kernel entry and exit and found it very expensive.
The only sensible way of doing this is via hardware support and that is
what PLZA enables.
Thanks
Babu
>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>> number of use cases that can be supported. Consider, for example, an existing
>>> "high priority" resource group and a "low priority" resource group. The user may
>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>> cache may need more care, but if, for example, user is only interested in memory
>>> bandwidth allocation this seems a reasonable use case?
>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>> capable of in terms of number of different control groups/CLOSID that can be
>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>> example, create a resource group that contains tasks of interest and create
>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>> This will give user space better insight into system behavior and from what I can
>>> tell is supported by the feature but not enabled?
>>
>> Yes, as long as PLZA is enabled on only one group in the entire system
>>
>>>>>> 2) It can't be the root/default group
>>>>> This is something I added to keep the default group in a un-disturbed,
>>> Why was this needed?
>>>
>> With the new approach mentioned about we can enable in default group also.
>>
>>>>>> 3) It can't have sub monitor groups
>>> Why not?
>> Ditto. With the new approach mentioned about we can enable in default group
>> also.
>>
>>>>>> 4) It can't be pseudo-locked
>>>>> Yes.
>>>>>
>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>> need to change.
>>>>> Yes. That can be one use case.
>>>>>
>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>> do:
>>>>>>
>>>>>> # echo '*' > tasks
>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>> complications since this designation makes resource group behave differently and
>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>>>
>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>> resource group to manage user space and kernel space allocations while also supporting
>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>> use case where user space can create a new resource group with certain allocations but the
>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>> the resource group's allocations when in CPL0.
>> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>>
>> We need make sure only one group can configured in the system and not allow
>> in other groups when it is already enabled.
>>
>> Thanks
>> Babu
>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>>
> -Tony
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-14 0:10 ` Reinette Chatre
2026-02-16 15:41 ` Ben Horgan
@ 2026-02-16 22:36 ` Moger, Babu
1 sibling, 0 replies; 114+ messages in thread
From: Moger, Babu @ 2026-02-16 22:36 UTC (permalink / raw)
To: Reinette Chatre, Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Reinette,
On 2/13/2026 6:10 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/13/26 8:37 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>
>>>>
>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>> Babu,
>>>>>>
>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>
>>>>>> Some useful additions to your explanation.
>>>>>>
>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>
>>>>> Yes. Correct.
>>>
>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>>
>> There can be only one PLZA configuration in a system. The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID, CLOSID_EN) must be identical across all logical processors. The only field that may differ is PLZA_EN.
>
> ah - this is a significant part that I missed. Since this is a per-CPU register it seems
> to have the ability for expanded use in the future where different CLOSID and RMID may be
> written to it? Is PLZA leaving room for such future enhancement or does the spec contain
> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN,
> CLOSID, CLOSID_EN) must be identical across all logical processors."? That is, "forever
> and always"?
>
It should be identical across all the logical processors. Don't know
about the future generations. Its better to keep that option open.
> If I understand correctly MPAM could have different PARTID and PMG for kernel use so we
> need to consider these different architectural behaviors.
>
oh ok.
>> I was initially unsure which RMID should be used when PLZA is enabled on MON groups.
>>
>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>
>> 1. Only one group in the system can have PLZA enabled.
>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA on MON group.
>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and RMID of the CTRL_MON group can be written.
>> 4. If PLZA is enabled on a MON group, then the CLOSID of the CTRL_MON group can be used, while the RMID of the MON group can be written.
>>
>> I am thinking this approach should work.
>>
>>>
>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>> number of use cases that can be supported. Consider, for example, an existing
>>> "high priority" resource group and a "low priority" resource group. The user may
>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>> cache may need more care, but if, for example, user is only interested in memory
>>> bandwidth allocation this seems a reasonable use case?
>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>> capable of in terms of number of different control groups/CLOSID that can be
>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>> example, create a resource group that contains tasks of interest and create
>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>> This will give user space better insight into system behavior and from what I can
>>> tell is supported by the feature but not enabled?
>>
>>
>> Yes, as long as PLZA is enabled on only one group in the entire system
>>
>>>
>>>>>
>>>>>> 2) It can't be the root/default group
>>>>>
>>>>> This is something I added to keep the default group in a un-disturbed,
>>>
>>> Why was this needed?
>>>
>>
>> With the new approach mentioned about we can enable in default group also.
>>
>>>>>
>>>>>> 3) It can't have sub monitor groups
>>>
>>> Why not?
>>
>> Ditto. With the new approach mentioned about we can enable in default group also.
>>
>>>
>>>>>> 4) It can't be pseudo-locked
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>> need to change.
>>>>>
>>>>> Yes. That can be one use case.
>>>>>
>>>>>>
>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>> do:
>>>>>>
>>>>>> # echo '*' > tasks
>>>
>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>> complications since this designation makes resource group behave differently and
>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>>>
>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>> resource group to manage user space and kernel space allocations while also supporting
>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>> use case where user space can create a new resource group with certain allocations but the
>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>> the resource group's allocations when in CPL0.
>>
>> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>>
>> We need make sure only one group can configured in the system and not allow in other groups when it is already enabled.
>
> As I understand this means that only one group can have content in its
> tasks_cpl0/tasks_kernel file. There should not be any special handling for
> the remaining files of the resource group since the resource group is not
> dedicated to kernel work and can be used as a user space resource group also.
> If user space wants to create a dedicated kernel resource group there can be
> a new resource group with an empty tasks file.
Correct.
>
> hmmm ... but if user space writes a task ID to a tasks_cpl0/tasks_kernel file then
> resctrl would need to create new syntax to remove that task ID.
I'm not sure I fully understand this, so let me restate the sequence:
Example 1: Regular group
# mkdir /sys/fs/resctrl/test1
This creates a normal resctrl group.
# echo 1 > /sys/fs/resctrl/test1/tasks
The group is still a normal group at this point.
# echo 1 > /sys/fs/resctrl/tasks_cpl0
This converts the group into a PLZA group (task’s tsk->plza field
becomes 1). This works as a regular group because both user and kernel
CLOSIDs are same.
Now create another group:
# mkdir /sys/fs/resctrl/test2
# echo 2 > /sys/fs/resctrl/tasks_cpl0
On AMD systems, this should fail with an error like: “Group test1 is
already configured as a PLZA group”, because only one PLZA group is allowed.
Now remove task 1 from the test1 group:
# echo "" > /sys/fs/resctrl/tasks_cpl0
This resets the group back to a regular group.
Now try again to make test2 a PLZA group:
# echo 2 > /sys/fs/resctrl/tasks_cpl0
This should now succeed.
It makes more sense to have a dedicated group for plza in case of AMD.
This option also allows to have a mix of both in one group.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-16 15:41 ` Ben Horgan
@ 2026-02-16 22:52 ` Moger, Babu
2026-02-17 15:56 ` Ben Horgan
0 siblings, 1 reply; 114+ messages in thread
From: Moger, Babu @ 2026-02-16 22:52 UTC (permalink / raw)
To: Ben Horgan, Reinette Chatre, Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Ben,
On 2/16/2026 9:41 AM, Ben Horgan wrote:
> Hi Babu, Reinette,
>
> On 2/14/26 00:10, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 2/13/26 8:37 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>
>>>>>
>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>> Babu,
>>>>>>>
>>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>>
>>>>>>> Some useful additions to your explanation.
>>>>>>>
>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>
>>>>>> Yes. Correct.
>>>>
>>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>>>
>>> There can be only one PLZA configuration in a system. The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID, CLOSID_EN) must be identical across all logical processors. The only field that may differ is PLZA_EN.
>
> Does this have any effect on hypervisors?
Because hypervisor runs at CPL0, there could be some use case. I have
not completely understood that part.
>
>>
>> ah - this is a significant part that I missed. Since this is a per-CPU register it seems
>
> I also missed that.
>
>> to have the ability for expanded use in the future where different CLOSID and RMID may be
>> written to it? Is PLZA leaving room for such future enhancement or does the spec contain
>> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN,
>> CLOSID, CLOSID_EN) must be identical across all logical processors."? That is, "forever
>> and always"?
>>
>> If I understand correctly MPAM could have different PARTID and PMG for kernel use so we
>> need to consider these different architectural behaviors.
>
> Yes, MPAM has a per-cpu register MPAM1_EL1.
>
oh ok.
>>
>>> I was initially unsure which RMID should be used when PLZA is enabled on MON groups.
>>>
>>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>>
>>> 1. Only one group in the system can have PLZA enabled.
>>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA on MON group.
>>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and RMID of the CTRL_MON group can be written.
>>> 4. If PLZA is enabled on a MON group, then the CLOSID of the CTRL_MON group can be used, while the RMID of the MON group can be written.
>
> Given that CLOSID and RMID are fixed once in the PLZA configuration
> could this be simplified by just assuming they have the values of the
> default group, CLOSID=0 and RMID=0 and let the user base there
> configuration on that?
>
I didn't understand this question. There are 16 CLOSIDs and 1024 RMIDs.
We can use any one of these to enable PLZA. It is not fixed in that sense.
>>>
>>> I am thinking this approach should work.
>>>
>>>>
>>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>>> number of use cases that can be supported. Consider, for example, an existing
>>>> "high priority" resource group and a "low priority" resource group. The user may
>>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>>> cache may need more care, but if, for example, user is only interested in memory
>>>> bandwidth allocation this seems a reasonable use case?
>>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>>> capable of in terms of number of different control groups/CLOSID that can be
>>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>>> example, create a resource group that contains tasks of interest and create
>>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>>> This will give user space better insight into system behavior and from what I can
>>>> tell is supported by the feature but not enabled?
>>>
>>>
>>> Yes, as long as PLZA is enabled on only one group in the entire system
>>>
>>>>
>>>>>>
>>>>>>> 2) It can't be the root/default group
>>>>>>
>>>>>> This is something I added to keep the default group in a un-disturbed,
>>>>
>>>> Why was this needed?
>>>>
>>>
>>> With the new approach mentioned about we can enable in default group also.
>>>
>>>>>>
>>>>>>> 3) It can't have sub monitor groups
>>>>
>>>> Why not?
>>>
>>> Ditto. With the new approach mentioned about we can enable in default group also.
>>>
>>>>
>>>>>>> 4) It can't be pseudo-locked
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>>> need to change.
>>>>>>
>>>>>> Yes. That can be one use case.
>>>>>>
>>>>>>>
>>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>>> do:
>>>>>>>
>>>>>>> # echo '*' > tasks
>>>>
>>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>>> complications since this designation makes resource group behave differently and
>>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>>>>
>>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>>> resource group to manage user space and kernel space allocations while also supporting
>>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>>> use case where user space can create a new resource group with certain allocations but the
>>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>>> the resource group's allocations when in CPL0.
>>>
>>> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>>>
>>> We need make sure only one group can configured in the system and not allow in other groups when it is already enabled.
>>
>> As I understand this means that only one group can have content in its
>> tasks_cpl0/tasks_kernel file. There should not be any special handling for
>> the remaining files of the resource group since the resource group is not
>> dedicated to kernel work and can be used as a user space resource group also.
>> If user space wants to create a dedicated kernel resource group there can be
>> a new resource group with an empty tasks file.
>>
>> hmmm ... but if user space writes a task ID to a tasks_cpl0/tasks_kernel file then
>> resctrl would need to create new syntax to remove that task ID.
>>
>> Possibly MPAM can build on this by allowing user space to write to multiple
>> tasks_cpl0/tasks_kernel files? (and the next version of PLZA may too)
>>
>> Reinette
>>
>>
>>>
>>> Thanks
>>> Babu
>>>
>>>>
>>>> Reinette
>>>>
>>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>>>
>>>
>>
>>
>
> Thanks,
>
> Ben
>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-16 22:52 ` Moger, Babu
@ 2026-02-17 15:56 ` Ben Horgan
2026-02-17 16:38 ` Babu Moger
2026-02-18 6:22 ` Stephane Eranian
0 siblings, 2 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-17 15:56 UTC (permalink / raw)
To: Moger, Babu, Reinette Chatre, Moger, Babu, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Babu,
On 2/16/26 22:52, Moger, Babu wrote:
> Hi Ben,
>
> On 2/16/2026 9:41 AM, Ben Horgan wrote:
>> Hi Babu, Reinette,
>>
>> On 2/14/26 00:10, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 2/13/26 8:37 AM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>>
>>>>>>
>>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>>> Babu,
>>>>>>>>
>>>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>>>
>>>>>>>> Some useful additions to your explanation.
>>>>>>>>
>>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>>
>>>>>>> Yes. Correct.
>>>>>
>>>>> Why limit it to one CTRL_MON group and why not support it for MON
>>>>> groups?
>>>>
>>>> There can be only one PLZA configuration in a system. The values in
>>>> the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID,
>>>> CLOSID_EN) must be identical across all logical processors. The only
>>>> field that may differ is PLZA_EN.
>>
>> Does this have any effect on hypervisors?
>
> Because hypervisor runs at CPL0, there could be some use case. I have
> not completely understood that part.
>
>>
>>>
>>> ah - this is a significant part that I missed. Since this is a per-
>>> CPU register it seems
>>
>> I also missed that.
>>
>>> to have the ability for expanded use in the future where different
>>> CLOSID and RMID may be
>>> written to it? Is PLZA leaving room for such future enhancement or
>>> does the spec contain
>>> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC
>>> register (RMID, RMID_EN,
>>> CLOSID, CLOSID_EN) must be identical across all logical processors."?
>>> That is, "forever
>>> and always"?
>>>
>>> If I understand correctly MPAM could have different PARTID and PMG
>>> for kernel use so we
>>> need to consider these different architectural behaviors.
>>
>> Yes, MPAM has a per-cpu register MPAM1_EL1.
>>
>
> oh ok.
>
>>>
>>>> I was initially unsure which RMID should be used when PLZA is
>>>> enabled on MON groups.
>>>>
>>>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>>>
>>>> 1. Only one group in the system can have PLZA enabled.
>>>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA
>>>> on MON group.
>>>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and
>>>> RMID of the CTRL_MON group can be written.
>>>> 4. If PLZA is enabled on a MON group, then the CLOSID of the
>>>> CTRL_MON group can be used, while the RMID of the MON group can be
>>>> written.
>>
>> Given that CLOSID and RMID are fixed once in the PLZA configuration
>> could this be simplified by just assuming they have the values of the
>> default group, CLOSID=0 and RMID=0 and let the user base there
>> configuration on that?
>>
>
> I didn't understand this question. There are 16 CLOSIDs and 1024 RMIDs.
> We can use any one of these to enable PLZA. It is not fixed in that sense.
Sorry, I wasn't clear. What I'm trying to understand is what you gain by
this flexibility. Given that the values CLOSID and the RMID are just
identifiers within the hardware and have only the meaning they are given
by the grouping and controls/monitors set up by resctrl (or any other
software interface) would you lose anything by just saying the PLZA
group has CLOSID=0 and RMID=0. Is there value in changing the PLZA
CLOSID and RMID or can the same effect happen by just changing the
resctrl configuration?
I was also wondering if using the default group this way would mean that
you wouldn't need to reserve the group for only kernel use.
>
>
>>>>
>>>> I am thinking this approach should work.
>>>>
>>>>>
>>>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This
>>>>> reduces the
>>>>> number of use cases that can be supported. Consider, for
>>>>> example, an existing
>>>>> "high priority" resource group and a "low priority" resource
>>>>> group. The user may
>>>>> just want to let the tasks in the "low priority" resource
>>>>> group run as "high priority"
>>>>> when in CPL0. This of course may depend on what resources are
>>>>> allocated, for example
>>>>> cache may need more care, but if, for example, user is only
>>>>> interested in memory
>>>>> bandwidth allocation this seems a reasonable use case?
>>>>> 2) Similar to what Tony [1] mentioned this does not enable what the
>>>>> hardware is
>>>>> capable of in terms of number of different control groups/
>>>>> CLOSID that can be
>>>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one
>>>>> CLOSID?
>>>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC
>>>>> similar to
>>>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user
>>>>> space to, for
>>>>> example, create a resource group that contains tasks of
>>>>> interest and create
>>>>> a monitor group within it that monitors all tasks' bandwidth
>>>>> usage when in CPL0.
>>>>> This will give user space better insight into system behavior
>>>>> and from what I can
>>>>> tell is supported by the feature but not enabled?
>>>>
>>>>
>>>> Yes, as long as PLZA is enabled on only one group in the entire system
>>>>
>>>>>
>>>>>>>
>>>>>>>> 2) It can't be the root/default group
>>>>>>>
>>>>>>> This is something I added to keep the default group in a un-
>>>>>>> disturbed,
>>>>>
>>>>> Why was this needed?
>>>>>
>>>>
>>>> With the new approach mentioned about we can enable in default group
>>>> also.
>>>>
>>>>>>>
>>>>>>>> 3) It can't have sub monitor groups
>>>>>
>>>>> Why not?
>>>>
>>>> Ditto. With the new approach mentioned about we can enable in
>>>> default group also.
>>>>
>>>>>
>>>>>>>> 4) It can't be pseudo-locked
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> Would a potential use case involve putting *all* tasks into the
>>>>>>>> PLZA group? That
>>>>>>>> would avoid any additional context switch overhead as the PLZA
>>>>>>>> MSR would never
>>>>>>>> need to change.
>>>>>>>
>>>>>>> Yes. That can be one use case.
>>>>>>>
>>>>>>>>
>>>>>>>> If that is the case, maybe for the PLZA group we should allow
>>>>>>>> user to
>>>>>>>> do:
>>>>>>>>
>>>>>>>> # echo '*' > tasks
>>>>>
>>>>> Dedicating a resource group to "PLZA" seems restrictive while also
>>>>> adding many
>>>>> complications since this designation makes resource group behave
>>>>> differently and
>>>>> thus the files need to get extra "treatments" to handle this "PLZA"
>>>>> designation.
>>>>>
>>>>> I am wondering if it will not be simpler to introduce just one new
>>>>> file, for example
>>>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space
>>>>> writes a task ID to the
>>>>> file it "enables" PLZA for this task and that group's CLOSID and
>>>>> RMID is the associated
>>>>> task's "PLZA" CLOSID and RMID. This gives user space the
>>>>> flexibility to use the same
>>>>> resource group to manage user space and kernel space allocations
>>>>> while also supporting
>>>>> various monitoring use cases. This still supports the "dedicate a
>>>>> resource group to PLZA"
>>>>> use case where user space can create a new resource group with
>>>>> certain allocations but the
>>>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks
>>>>> needing to run with
>>>>> the resource group's allocations when in CPL0.
>>>>
>>>> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>>>>
>>>> We need make sure only one group can configured in the system and
>>>> not allow in other groups when it is already enabled.
>>>
>>> As I understand this means that only one group can have content in its
>>> tasks_cpl0/tasks_kernel file. There should not be any special
>>> handling for
>>> the remaining files of the resource group since the resource group is
>>> not
>>> dedicated to kernel work and can be used as a user space resource
>>> group also.
>>> If user space wants to create a dedicated kernel resource group there
>>> can be
>>> a new resource group with an empty tasks file.
>>>
>>> hmmm ... but if user space writes a task ID to a tasks_cpl0/
>>> tasks_kernel file then
>>> resctrl would need to create new syntax to remove that task ID.
>>>
>>> Possibly MPAM can build on this by allowing user space to write to
>>> multiple
>>> tasks_cpl0/tasks_kernel files? (and the next version of PLZA may too)
>>>
>>> Reinette
>>>
>>>
>>>>
>>>> Thanks
>>>> Babu
>>>>
>>>>>
>>>>> Reinette
>>>>>
>>>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>>>>
>>>>
>>>
>>>
>>
>> Thanks,
>>
>> Ben
>>
>>
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 15:56 ` Ben Horgan
@ 2026-02-17 16:38 ` Babu Moger
2026-02-18 9:54 ` Ben Horgan
2026-02-18 6:22 ` Stephane Eranian
1 sibling, 1 reply; 114+ messages in thread
From: Babu Moger @ 2026-02-17 16:38 UTC (permalink / raw)
To: Ben Horgan, Moger, Babu, Reinette Chatre, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Ben,
On 2/17/26 09:56, Ben Horgan wrote:
> Hi Babu,
>
> On 2/16/26 22:52, Moger, Babu wrote:
>> Hi Ben,
>>
>> On 2/16/2026 9:41 AM, Ben Horgan wrote:
>>> Hi Babu, Reinette,
>>>
>>> On 2/14/26 00:10, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 2/13/26 8:37 AM, Moger, Babu wrote:
>>>>> Hi Reinette,
>>>>>
>>>>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>>>>> Hi Babu,
>>>>>>
>>>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>>>
>>>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>>>> Babu,
>>>>>>>>>
>>>>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>>>>
>>>>>>>>> Some useful additions to your explanation.
>>>>>>>>>
>>>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>>> Yes. Correct.
>>>>>> Why limit it to one CTRL_MON group and why not support it for MON
>>>>>> groups?
>>>>> There can be only one PLZA configuration in a system. The values in
>>>>> the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID,
>>>>> CLOSID_EN) must be identical across all logical processors. The only
>>>>> field that may differ is PLZA_EN.
>>> Does this have any effect on hypervisors?
>> Because hypervisor runs at CPL0, there could be some use case. I have
>> not completely understood that part.
>>
>>>> ah - this is a significant part that I missed. Since this is a per-
>>>> CPU register it seems
>>> I also missed that.
>>>
>>>> to have the ability for expanded use in the future where different
>>>> CLOSID and RMID may be
>>>> written to it? Is PLZA leaving room for such future enhancement or
>>>> does the spec contain
>>>> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC
>>>> register (RMID, RMID_EN,
>>>> CLOSID, CLOSID_EN) must be identical across all logical processors."?
>>>> That is, "forever
>>>> and always"?
>>>>
>>>> If I understand correctly MPAM could have different PARTID and PMG
>>>> for kernel use so we
>>>> need to consider these different architectural behaviors.
>>> Yes, MPAM has a per-cpu register MPAM1_EL1.
>>>
>> oh ok.
>>
>>>>> I was initially unsure which RMID should be used when PLZA is
>>>>> enabled on MON groups.
>>>>>
>>>>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>>>>
>>>>> 1. Only one group in the system can have PLZA enabled.
>>>>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA
>>>>> on MON group.
>>>>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and
>>>>> RMID of the CTRL_MON group can be written.
>>>>> 4. If PLZA is enabled on a MON group, then the CLOSID of the
>>>>> CTRL_MON group can be used, while the RMID of the MON group can be
>>>>> written.
>>> Given that CLOSID and RMID are fixed once in the PLZA configuration
>>> could this be simplified by just assuming they have the values of the
>>> default group, CLOSID=0 and RMID=0 and let the user base there
>>> configuration on that?
>>>
>> I didn't understand this question. There are 16 CLOSIDs and 1024 RMIDs.
>> We can use any one of these to enable PLZA. It is not fixed in that sense.
> Sorry, I wasn't clear. What I'm trying to understand is what you gain by
> this flexibility. Given that the values CLOSID and the RMID are just
> identifiers within the hardware and have only the meaning they are given
> by the grouping and controls/monitors set up by resctrl (or any other
> software interface) would you lose anything by just saying the PLZA
> group has CLOSID=0 and RMID=0. Is there value in changing the PLZA
> CLOSID and RMID or can the same effect happen by just changing the
> resctrl configuration?
>
> I was also wondering if using the default group this way would mean that
> you wouldn't need to reserve the group for only kernel use.
Yes, that is an option, but it becomes too restrictive. Would this
approach work for the ARM implementation?
If a user wants to keep a selective set of tasks running at different
allocation levels, they would need to create another new group, move all
tasks from default group to new group, and leave only the selected
tasks in the default group.
And if that group is later deleted, all tasks will automatically return
to the default group.
Thanks,
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-16 15:18 ` Ben Horgan
@ 2026-02-17 18:51 ` Reinette Chatre
2026-02-17 21:44 ` Luck, Tony
2026-02-19 10:21 ` Ben Horgan
0 siblings, 2 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-17 18:51 UTC (permalink / raw)
To: Ben Horgan
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Ben,
On 2/16/26 7:18 AM, Ben Horgan wrote:
> On Thu, Feb 12, 2026 at 10:37:21AM -0800, Reinette Chatre wrote:
>> On 2/12/26 5:55 AM, Ben Horgan wrote:
>>> On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
>>>> On 2/11/26 8:40 AM, Ben Horgan wrote:
>>>>> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>>>>>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>>>>>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>>>>>> instead of CPL0 using something like "kernel" or ... ?
>>>>>
>>>>> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
>>>>> internally and here are a few thoughts.
>>>>>
>>>>> If the user case is just that an option run all tasks with the same closid/rmid
>>>>> (partid/pmg) configuration when they are running in the kernel then I'd favour a
>>>>> mount option. The resctrl filesytem interface doesn't need to change and
>>>>
>>>> I view mount options as an interface of last resort. Why would a mount option be needed
>>>> in this case? The existence of the file used to configure the feature seems sufficient?
>>>
>>> If we are taking away a closid from the user then the number of CTRL_MON groups
>>> that can be created changes. It seems reasonable for user-space to expect
>>> num_closid to be a fixed value.
>>
>> I do you see why we need to take away a CLOSID from the user. Consider a user space that
>
> Yes, just slightly simpler to take away a CLOSID but could just go with the
> default CLOSID is also used for the kernel. I would be ok with a file saying the
> mode, like the mbm_event file does for counter assignment. It slightly misleading
> that a configuration file is under info but necessary as we don't have another
> location global to the resctrl mount.
Indeed, the "info" directory has evolved more into a "config" directory.
>> runs with just two resource groups, for example, "high priority" and "low priority", it seems
>> reasonable to make it possible to let the "low priority" tasks run with "high priority"
>> allocations when in kernel space without needing to dedicate a new CLOSID? More reasonable
>> when only considering memory bandwidth allocation though.
>>
>>>
>>>>
>>>> Also ...
>>>>
>>>> I do not think resctrl should unnecessarily place constraints on what the hardware
>>>> features are capable of. As I understand, both PLZA and MPAM supports use case where
>>>> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
>>>> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
>>>> This may be because I am not familiar with all the requirements here so please do
>>>> help with insight on how the hardware feature is intended to be used as it relates
>>>> to its design.
>>>>
>>>> We have to be very careful when constraining a feature this much If resctrl does something
>>>> like this it essentially restricts what users could do forever.
>>>
>>> Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
>>> fixed kernel CLOSID/RMID configuration option might just give all we need for
>>> usecases we know we have and be minimally intrusive enough to not preclude a
>>> more featureful PLZA later when new usecases come about.
>>
>> Having ability to grow features would be ideal. I do not see how a fixed kernel CLOSID/RMID
>> configuration leaves room to build on top though. Could you please elaborate?
>
> If we initially go with a single new configuration file, e.g. kernel_mode, which
> could be "match_user" or "use_root, this would be the only initial change to the
> interface needed. If more usecases present themselves a new mode could be added,
> e.g. "configurable", and an interface to actually change the rmid/closid for the
> kernel could be added.
Something like this could be a base to work from. I think only the two ("match_user" and
"use_root") are a bit limiting for even the initial implementation though.
As I understand, "use_root" implies using the allocations of the default group but
does not indicate what MON group (which RMID/PMG) should be used to monitor the
work done in kernel space. A way to specify the actual group may be needed?
>> I wonder if the benefit of the fixed CLOSID/RMID is perhaps mostly in the cost of
>> context switching which I do not think is a concern for MPAM but it may be for PLZA?
>>
>> One option to support fixed kernel CLOSID/RMID at the beginning and leave room to build
>> may be to create the kernel_group or "tasks_kernel" interface as a baseline but in first
>> implementation only allow user space to write the same group to all "kernel_group" files or
>> to only allow to write to one of the "tasks_kernel" files in the resctrl fs hierarchy. At
>> that time the associated CLOSID/RMID would become the "fixed configuration" and attempts to
>> write to others can return "ENOSPC"?
>
> I think we'd have to be sure of the final interface if we go this way.
I do not think we should aim to know the final interface since that requires knowing all future
hardware features and their implementations in advance. Instead we should aim to have something
that we can build on that is accompanied by documentation that supports future flexibility (some may
refer to this as "weasel words").
>> From what I can tell this still does not require to take away a CLOSID/RMID from user space
>> though. Dedicating a CLOSID/RMID to kernel work can still be done but be in control of user
>> that can, for example leave the "tasks" and "cpus" files empty.
>>
>>> One complication is that for fixed kernel CLOSID/RMID option is that for x86 you
>>> may want to be able to monitor a tasks resource usage whether or not it is in
>>> the kernel or userspace and so only have a fixed CLOSID. However, for MPAM this
>>> wouldn't work as PMG (~RMID) is scoped to PARTID (~CLOSID).
>>>
>>>>
>>>>> userspace software doesn't need to change. This could either take away a
>>>>> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
>>>>> policy to have the default group as the kernel group. If you use the default
>>>>
>>>> Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
>>>> between user space and kernel. I do not see a motivation for resctrl to place such
>>>> constraint.
>>>>
>>>>> configuration, at least for MPAM, the kernel may not be running at the highest
>>>>> priority as a minimum bandwidth can be used to give a priority boost. (Once we
>>>>> have a resctrl schema for this.)
>>>>>
>>>>> It could be useful to have something a bit more featureful though. Is there a
>>>>> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
>>>>> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
>>>>> it's not a single write to move a task. If a single mapping is sufficient, then
>>>>
>>>> Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
>>>> I think the MPAM approach is better and there may be opportunity to do this in a similar
>>>> way and both architectures use the same field(s) in the task_struct.
>>>
>>> I was referring to the userspace file write but unifying on a the same fields in
>>> task_struct could be good. The single write is necessary for MPAM as PMG is
>>> scoped to PARTID and I don't think x86 behaviour changes if it moves to the same
>>> approach.
>>>
>>
>> ah - I misunderstood. You are suggesting to have one file that user writes to
>> to set both user space and kernel space CLOSID/RMID? This sounds like what the
>
> Yes, the kernel_groups idea does partially have this as once you've set the
> kernel_group for a CTRL_MON or MON group then the user space configuration
> dictates the kernel space configuration. As you pointed out, this is also
> a draw back of the kernel_groups idea.
>
>> existing "tasks" file does but only supports the same CLOSID/RMID for both user
>> space and kernel space. To support the new hardware features where the CLOSID/RMID
>> can be different we cannot just change "tasks" interface and would need to keep it
>> backward compatible. So far I assumed that it would be ok for the "tasks" file
>> to essentially get new meaning as the CLOSID/RMID for just user space work, which
>> seems to require a second file for kernel space as a consequence? So far I have
>> not seen an option that does not change meaning of the "tasks" file.
>
> Would it make sense to have some new type of entries in the tasks file,
> e.g. k_ctrl_<pid>, k_mon_<pid> to say, in the kernel, use the closid of this
> CTRL_MON for this task pid or use the rmid of this CTRL_MON/MON group for this task
> pid? We would still probably need separate files for the cpu configuration.
I am obligated to nack such a change to the tasks file since it would impact any
existing user space parsing of this file.
>
> If separate files make more sense, then we might need 2 extra tasks files to
> decouple closid and rmid, e.g. tasks_k_ctrl and task_k_mon. The task_k_mon would
> be in all CTRL_MON and MON groups and determine the rmid and tasks_k_ctrl just
> in a CTRL_MON group and determine a closid.
This is possible, yes.
>>>>> as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
>>>>> suggested above but rather than a task that file could hold a path to the
>>>>> CTRL_MON/MON group that provides the kernel configuraion for tasks running in
>>>>> that group. So that this can be transparent to existing software an empty string
>>>>
>>>> Something like this would force all tasks of a group to run with the same CLOSID/RMID
>>>> (PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
>>>> and may reduce the possible use case of this feature.
>>>>
>>>> For example,
>>>> - There may be a scenario where there is a set of tasks with a particular allocation
>>>> when running in user space but when in kernel these tasks benefit from different
>>>> allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
>>>> user space with allocations from resource_groupA. While these tasks are ok with this
>>>> allocation when in user space they have different requirements when it comes to
>>>> kernel space. There may be a resource_groupB that allocates a lot of resources ("high
>>>> priority") that task 1 should use for kernel work and a resource_groupC that allocates
>>>> fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
>>>>
>>>> resource_groupA:
>>>> schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
>>>> tasks when in user space: 1, 2, 3
>>>>
>>>> resource_groupB:
>>>> schemata: <high priority allocations>
>>>> tasks when in kernel space: 1
>>>>
>>>> resource_groupC:
>>>> schemata: <medium priority allocations>
>>>> tasks when in kernel space: 2, 3
>>>
>>> I'm not sure if this would happen in the real world or not.
>>
>> Ack. I would like to echo Tony's request for feedback from resctrl users
>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>
> Indeed. This is all getting a bit complicated.
>
ack
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 18:51 ` Reinette Chatre
@ 2026-02-17 21:44 ` Luck, Tony
2026-02-17 22:37 ` Reinette Chatre
2026-02-19 11:06 ` Ben Horgan
2026-02-19 10:21 ` Ben Horgan
1 sibling, 2 replies; 114+ messages in thread
From: Luck, Tony @ 2026-02-17 21:44 UTC (permalink / raw)
To: Reinette Chatre
Cc: Ben Horgan, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
> >>> I'm not sure if this would happen in the real world or not.
> >>
> >> Ack. I would like to echo Tony's request for feedback from resctrl users
> >> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
> >
> > Indeed. This is all getting a bit complicated.
> >
>
> ack
We have several proposals so far:
1) Ben's suggestion to use the default group (either with a Babu-style
"plza" file just in that group, or a configuration file under "info/").
This is easily the simplest for implementation, but has no flexibility.
Also requires users to move all the non-critical workloads out to other
CTRL_MON groups. Doesn't steal a CLOSID/RMID.
2) My thoughts are for a separate group that is only used to configure
the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
are used for all tasks when in kernel mode.
No context switch overhead. Has some flexibility.
3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
group in addition to belonging to another group than defines schemata
resources when running in non-kernel mode.
Tasks aren't required to be in the kernel group, in which case they
keep the same CLOSID in both user and kernel mode. When used in this
way there will be context switch overhead when changing between tasks
with different kernel CLOSID/RMID.
4) Even more complex scenarios with more than one user configurable
kernel group to give more options on resources available in the kernel.
I had a quick pass as coding my option "2". My UI to designate the
group to use for kernel mode is to reserve the name "kernel_group"
when making CTRL_MON groups. Some tweaks to avoid creating the
"tasks", "cpus", and "cpus_list" files (which might be done more
elegantly), and "mon_groups" directory in this group.
I just have stubs in the arch/x86 core.c file for enumeration and
enable/disable. Just realized I'm missing a call to disable on
unmount of the resctrl file system.
Apart from umount, I think it is more or less complete, and fairly
compact:
arch/x86/kernel/cpu/resctrl/core.c | 25 +++++++++++++++++++++++++
fs/resctrl/internal.h | 9 +++++++--
fs/resctrl/rdtgroup.c | 49 ++++++++++++++++++++++++++++++++++++-------------
include/linux/resctrl.h | 4 ++++
4 files changed, 72 insertions(+), 15 deletions(-)
-Tony
---
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 006e57fd7ca5..540ab9d7621a 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -702,6 +702,10 @@ bool resctrl_arch_get_io_alloc_enabled(struct rdt_resource *r);
extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
+bool resctrl_arch_kernel_group_is_supported(void);
+void resctrl_arch_kernel_group_enable(u32 closid, u32 rmid);
+void resctrl_arch_kernel_group_disable(void);
+
int resctrl_init(void);
void resctrl_exit(void);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 1a9b29119f88..99fbdcaf3c63 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -156,6 +156,7 @@ extern bool resctrl_mounted;
enum rdt_group_type {
RDTCTRL_GROUP = 0,
RDTMON_GROUP,
+ RDTKERNEL_GROUP,
RDT_NUM_GROUP,
};
@@ -245,6 +246,8 @@ struct rdtgroup {
#define RFTYPE_BASE BIT(1)
+#define RFTYPE_TASKS_CPUS BIT(2)
+
#define RFTYPE_CTRL BIT(4)
#define RFTYPE_MON BIT(5)
@@ -267,9 +270,11 @@ struct rdtgroup {
#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP)
-#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL)
+#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_TASKS_CPUS | RFTYPE_CTRL)
+
+#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_TASKS_CPUS | RFTYPE_MON)
-#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON)
+#define RFTYPE_KERNEL_BASE (RFTYPE_BASE | RFTYPE_CTRL)
/* List of all resource groups */
extern struct list_head rdt_all_groups;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7667cf7c4e94..94d20b200e47 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -733,6 +733,28 @@ static void clear_closid_rmid(int cpu)
RESCTRL_RESERVED_CLOSID);
}
+static bool kernel_group_is_enabled;
+static u32 kernel_group_closid, kernel_group_rmid;
+
+bool resctrl_arch_kernel_group_is_supported(void)
+{
+ return true;
+}
+
+void resctrl_arch_kernel_group_enable(u32 closid, u32 rmid)
+{
+ pr_info("Enable kernel group on all CPUs here closid=%u rmid=%u\n", closid, rmid);
+ kernel_group_closid = closid;
+ kernel_group_rmid = rmid;
+ kernel_group_is_enabled = true;
+}
+
+void resctrl_arch_kernel_group_disable(void)
+{
+ pr_info("Disable kernel group on all CPUs here\n");
+ kernel_group_is_enabled = false;
+}
+
static int resctrl_arch_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;
@@ -743,6 +765,9 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
mutex_unlock(&domain_list_lock);
clear_closid_rmid(cpu);
+ if (kernel_group_is_enabled)
+ pr_info("Enable kernel group on CPU:%d closid=%u rmid=%u\n",
+ cpu, kernel_group_closid, kernel_group_rmid);
resctrl_online_cpu(cpu);
return 0;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index ba8d503551cd..0d396569a76a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2046,7 +2046,7 @@ static struct rftype res_common_files[] = {
.kf_ops = &rdtgroup_kf_single_ops,
.write = rdtgroup_cpus_write,
.seq_show = rdtgroup_cpus_show,
- .fflags = RFTYPE_BASE,
+ .fflags = RFTYPE_BASE | RFTYPE_TASKS_CPUS,
},
{
.name = "cpus_list",
@@ -2055,7 +2055,7 @@ static struct rftype res_common_files[] = {
.write = rdtgroup_cpus_write,
.seq_show = rdtgroup_cpus_show,
.flags = RFTYPE_FLAGS_CPUS_LIST,
- .fflags = RFTYPE_BASE,
+ .fflags = RFTYPE_BASE | RFTYPE_TASKS_CPUS,
},
{
.name = "tasks",
@@ -2063,14 +2063,14 @@ static struct rftype res_common_files[] = {
.kf_ops = &rdtgroup_kf_single_ops,
.write = rdtgroup_tasks_write,
.seq_show = rdtgroup_tasks_show,
- .fflags = RFTYPE_BASE,
+ .fflags = RFTYPE_BASE | RFTYPE_TASKS_CPUS,
},
{
.name = "mon_hw_id",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_rmid_show,
- .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG,
+ .fflags = RFTYPE_BASE | RFTYPE_MON | RFTYPE_DEBUG,
},
{
.name = "schemata",
@@ -2078,7 +2078,7 @@ static struct rftype res_common_files[] = {
.kf_ops = &rdtgroup_kf_single_ops,
.write = rdtgroup_schemata_write,
.seq_show = rdtgroup_schemata_show,
- .fflags = RFTYPE_CTRL_BASE,
+ .fflags = RFTYPE_BASE | RFTYPE_CTRL,
},
{
.name = "mba_MBps_event",
@@ -2093,14 +2093,14 @@ static struct rftype res_common_files[] = {
.kf_ops = &rdtgroup_kf_single_ops,
.write = rdtgroup_mode_write,
.seq_show = rdtgroup_mode_show,
- .fflags = RFTYPE_CTRL_BASE,
+ .fflags = RFTYPE_BASE | RFTYPE_CTRL,
},
{
.name = "size",
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_size_show,
- .fflags = RFTYPE_CTRL_BASE,
+ .fflags = RFTYPE_BASE | RFTYPE_CTRL,
},
{
.name = "sparse_masks",
@@ -2114,7 +2114,7 @@ static struct rftype res_common_files[] = {
.mode = 0444,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = rdtgroup_closid_show,
- .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG,
+ .fflags = RFTYPE_BASE | RFTYPE_CTRL | RFTYPE_DEBUG,
},
};
@@ -3788,11 +3788,15 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
}
if (rtype == RDTCTRL_GROUP) {
- files = RFTYPE_BASE | RFTYPE_CTRL;
+ files = RFTYPE_CTRL_BASE;
+ if (resctrl_arch_mon_capable())
+ files |= RFTYPE_MON_BASE;
+ } else if (rtype == RDTKERNEL_GROUP) {
+ files = RFTYPE_KERNEL_BASE;
if (resctrl_arch_mon_capable())
files |= RFTYPE_MON;
} else {
- files = RFTYPE_BASE | RFTYPE_MON;
+ files = RFTYPE_MON_BASE;
}
ret = rdtgroup_add_files(kn, files);
@@ -3866,12 +3870,21 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
const char *name, umode_t mode)
{
+ enum rdt_group_type rtype = RDTCTRL_GROUP;
struct rdtgroup *rdtgrp;
struct kernfs_node *kn;
u32 closid;
int ret;
- ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp);
+ if (!strcmp(name, "kernel_group")) {
+ if (!resctrl_arch_kernel_group_is_supported()) {
+ rdt_last_cmd_puts("No support for kernel group\n");
+ return -EINVAL;
+ }
+ rtype = RDTKERNEL_GROUP;
+ }
+
+ ret = mkdir_rdt_prepare(parent_kn, name, mode, rtype, &rdtgrp);
if (ret)
return ret;
@@ -3898,7 +3911,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
- if (resctrl_arch_mon_capable()) {
+ if (rtype == RDTCTRL_GROUP && resctrl_arch_mon_capable()) {
/*
* Create an empty mon_groups directory to hold the subset
* of tasks and cpus to monitor.
@@ -3912,6 +3925,9 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
rdtgrp->mba_mbps_event = mba_mbps_default_event;
}
+ if (rtype == RDTKERNEL_GROUP)
+ resctrl_arch_kernel_group_enable(rdtgrp->closid, rdtgrp->mon.rmid);
+
goto out_unlock;
out_del_list:
@@ -4005,6 +4021,11 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
u32 closid, rmid;
int cpu;
+ if (rdtgrp->type == RDTKERNEL_GROUP) {
+ resctrl_arch_kernel_group_disable();
+ goto skip_tasks_and_cpus;
+ }
+
/* Give any tasks back to the default group */
rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
@@ -4025,6 +4046,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
update_closid_rmid(tmpmask, NULL);
+skip_tasks_and_cpus:
rdtgroup_unassign_cntrs(rdtgrp);
free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
@@ -4073,7 +4095,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
* If the rdtgroup is a mon group and parent directory
* is a valid "mon_groups" directory, remove the mon group.
*/
- if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn &&
+ if ((rdtgrp->type == RDTCTRL_GROUP || rdtgrp->type == RDTKERNEL_GROUP) &&
+ parent_kn == rdtgroup_default.kn &&
rdtgrp != &rdtgroup_default) {
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 21:44 ` Luck, Tony
@ 2026-02-17 22:37 ` Reinette Chatre
2026-02-17 22:52 ` Luck, Tony
2026-02-19 11:06 ` Ben Horgan
1 sibling, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-17 22:37 UTC (permalink / raw)
To: Luck, Tony
Cc: Ben Horgan, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Tony,
On 2/17/26 1:44 PM, Luck, Tony wrote:
>>>>> I'm not sure if this would happen in the real world or not.
>>>>
>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>>>
>>> Indeed. This is all getting a bit complicated.
>>>
>>
>> ack
>
> We have several proposals so far:
>
> 1) Ben's suggestion to use the default group (either with a Babu-style
> "plza" file just in that group, or a configuration file under "info/").
>
> This is easily the simplest for implementation, but has no flexibility.
> Also requires users to move all the non-critical workloads out to other
> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
>
> 2) My thoughts are for a separate group that is only used to configure
> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
> are used for all tasks when in kernel mode.
>
> No context switch overhead. Has some flexibility.
>
> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
> group in addition to belonging to another group than defines schemata
> resources when running in non-kernel mode.
> Tasks aren't required to be in the kernel group, in which case they
> keep the same CLOSID in both user and kernel mode. When used in this
> way there will be context switch overhead when changing between tasks
> with different kernel CLOSID/RMID.
>
> 4) Even more complex scenarios with more than one user configurable
> kernel group to give more options on resources available in the kernel.
>
>
> I had a quick pass as coding my option "2". My UI to designate the
> group to use for kernel mode is to reserve the name "kernel_group"
> when making CTRL_MON groups. Some tweaks to avoid creating the
> "tasks", "cpus", and "cpus_list" files (which might be done more
> elegantly), and "mon_groups" directory in this group.
Should the decision of whether context switch overhead is acceptable
not be left up to the user?
I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
the needed registers will only be updated if there is a new CLOSID/RMID needed
for kernel space. Are you suggesting that just this checking itself is too
expensive to justify giving user space more flexibility by fully enabling what
the hardware supports? If resctrl does draw such a line to not enable what
hardware supports it should be well justified.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 22:37 ` Reinette Chatre
@ 2026-02-17 22:52 ` Luck, Tony
2026-02-17 23:55 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-02-17 22:52 UTC (permalink / raw)
To: Reinette Chatre
Cc: Ben Horgan, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
On Tue, Feb 17, 2026 at 02:37:49PM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 2/17/26 1:44 PM, Luck, Tony wrote:
> >>>>> I'm not sure if this would happen in the real world or not.
> >>>>
> >>>> Ack. I would like to echo Tony's request for feedback from resctrl users
> >>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
> >>>
> >>> Indeed. This is all getting a bit complicated.
> >>>
> >>
> >> ack
> >
> > We have several proposals so far:
> >
> > 1) Ben's suggestion to use the default group (either with a Babu-style
> > "plza" file just in that group, or a configuration file under "info/").
> >
> > This is easily the simplest for implementation, but has no flexibility.
> > Also requires users to move all the non-critical workloads out to other
> > CTRL_MON groups. Doesn't steal a CLOSID/RMID.
> >
> > 2) My thoughts are for a separate group that is only used to configure
> > the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
> > are used for all tasks when in kernel mode.
> >
> > No context switch overhead. Has some flexibility.
> >
> > 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
> > that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
> > group in addition to belonging to another group than defines schemata
> > resources when running in non-kernel mode.
> > Tasks aren't required to be in the kernel group, in which case they
> > keep the same CLOSID in both user and kernel mode. When used in this
> > way there will be context switch overhead when changing between tasks
> > with different kernel CLOSID/RMID.
> >
> > 4) Even more complex scenarios with more than one user configurable
> > kernel group to give more options on resources available in the kernel.
> >
> >
> > I had a quick pass as coding my option "2". My UI to designate the
> > group to use for kernel mode is to reserve the name "kernel_group"
> > when making CTRL_MON groups. Some tweaks to avoid creating the
> > "tasks", "cpus", and "cpus_list" files (which might be done more
> > elegantly), and "mon_groups" directory in this group.
>
> Should the decision of whether context switch overhead is acceptable
> not be left up to the user?
When someone comes up with a convincing use case to support one set of
kernel resources when interrupting task A, and a different set of
resources when interrupting task B, we should certainly listen.
> I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
> the needed registers will only be updated if there is a new CLOSID/RMID needed
> for kernel space.
Babu's RFC does this.
> Are you suggesting that just this checking itself is too
> expensive to justify giving user space more flexibility by fully enabling what
> the hardware supports? If resctrl does draw such a line to not enable what
> hardware supports it should be well justified.
The check is likley light weight (as long as the variables to be
compared reside in the same cache lines as the exisitng CLOSID
and RMID checks). So if there is a use case for different resources
when in kernel mode, then taking this path will be fine.
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 22:52 ` Luck, Tony
@ 2026-02-17 23:55 ` Reinette Chatre
2026-02-18 16:44 ` Luck, Tony
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-17 23:55 UTC (permalink / raw)
To: Luck, Tony
Cc: Ben Horgan, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Tony,
On 2/17/26 2:52 PM, Luck, Tony wrote:
> On Tue, Feb 17, 2026 at 02:37:49PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 2/17/26 1:44 PM, Luck, Tony wrote:
>>>>>>> I'm not sure if this would happen in the real world or not.
>>>>>>
>>>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
>>>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>>>>>
>>>>> Indeed. This is all getting a bit complicated.
>>>>>
>>>>
>>>> ack
>>>
>>> We have several proposals so far:
>>>
>>> 1) Ben's suggestion to use the default group (either with a Babu-style
>>> "plza" file just in that group, or a configuration file under "info/").
>>>
>>> This is easily the simplest for implementation, but has no flexibility.
>>> Also requires users to move all the non-critical workloads out to other
>>> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
>>>
>>> 2) My thoughts are for a separate group that is only used to configure
>>> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
>>> are used for all tasks when in kernel mode.
>>>
>>> No context switch overhead. Has some flexibility.
>>>
>>> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
>>> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
>>> group in addition to belonging to another group than defines schemata
>>> resources when running in non-kernel mode.
>>> Tasks aren't required to be in the kernel group, in which case they
>>> keep the same CLOSID in both user and kernel mode. When used in this
>>> way there will be context switch overhead when changing between tasks
>>> with different kernel CLOSID/RMID.
>>>
>>> 4) Even more complex scenarios with more than one user configurable
>>> kernel group to give more options on resources available in the kernel.
>>>
>>>
>>> I had a quick pass as coding my option "2". My UI to designate the
>>> group to use for kernel mode is to reserve the name "kernel_group"
>>> when making CTRL_MON groups. Some tweaks to avoid creating the
>>> "tasks", "cpus", and "cpus_list" files (which might be done more
>>> elegantly), and "mon_groups" directory in this group.
>>
>> Should the decision of whether context switch overhead is acceptable
>> not be left up to the user?
>
> When someone comes up with a convincing use case to support one set of
> kernel resources when interrupting task A, and a different set of
> resources when interrupting task B, we should certainly listen.
Absolutely. Someone can come up with such use case at any time tough. This
could be, and as has happened with some other resctrl interfaces, likely will be
after this feature has been supported for a few kernel versions. What timeline
should we give which users to share their use cases with us? Even if we do hear
from some users will that guarantee that no such use case will arise in the
future? Such predictions of usage are difficult for me and I thus find it simpler
to think of flexible ways to enable the features that we know the hardware supports.
This does not mean that a full featured solution needs to be implemented from day 1.
If folks believe there are "no valid use cases" today resctrl still needs to prepare for
how it can grow to support full hardware capability and hardware designs in the
future.
Also, please also consider not just resources for kernel work but also monitoring for
kernel work. I do think, for example, a reasonable use case may be to determine
how much memory bandwidth the kernel uses on behalf of certain tasks.
>> I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
>> the needed registers will only be updated if there is a new CLOSID/RMID needed
>> for kernel space.
>
> Babu's RFC does this.
Right.
>
>> Are you suggesting that just this checking itself is too
>> expensive to justify giving user space more flexibility by fully enabling what
>> the hardware supports? If resctrl does draw such a line to not enable what
>> hardware supports it should be well justified.
>
> The check is likley light weight (as long as the variables to be
> compared reside in the same cache lines as the exisitng CLOSID
> and RMID checks). So if there is a use case for different resources
> when in kernel mode, then taking this path will be fine.
Why limit this to knowing about a use case? As I understand this feature can be
supported in a flexible way without introducing additional context switch overhead
if the user prefers to use just one allocation for all kernel work. By being
configurable and allowing resctrl to support more use cases in the future resctrl
does not paint itself into a corner. This allows resctrl to grow support so that
the user can use all capabilities of the hardware with understanding that it will
increase context switch time.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 15:56 ` Ben Horgan
2026-02-17 16:38 ` Babu Moger
@ 2026-02-18 6:22 ` Stephane Eranian
2026-02-18 9:35 ` Ben Horgan
1 sibling, 1 reply; 114+ messages in thread
From: Stephane Eranian @ 2026-02-18 6:22 UTC (permalink / raw)
To: Ben Horgan
Cc: Moger, Babu, Reinette Chatre, Moger, Babu, Luck, Tony,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
On Tue, Feb 17, 2026 at 7:56 AM Ben Horgan <ben.horgan@arm.com> wrote:
>
> Hi Babu,
>
> On 2/16/26 22:52, Moger, Babu wrote:
> > Hi Ben,
> >
> > On 2/16/2026 9:41 AM, Ben Horgan wrote:
> >> Hi Babu, Reinette,
> >>
> >> On 2/14/26 00:10, Reinette Chatre wrote:
> >>> Hi Babu,
> >>>
> >>> On 2/13/26 8:37 AM, Moger, Babu wrote:
> >>>> Hi Reinette,
> >>>>
> >>>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
> >>>>> Hi Babu,
> >>>>>
> >>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
> >>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> >>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
> >>>>>>>> Babu,
> >>>>>>>>
> >>>>>>>> I've read a bit more of the code now and I think I understand more.
> >>>>>>>>
> >>>>>>>> Some useful additions to your explanation.
> >>>>>>>>
> >>>>>>>> 1) Only one CTRL group can be marked as PLZA
> >>>>>>>
> >>>>>>> Yes. Correct.
> >>>>>
> >>>>> Why limit it to one CTRL_MON group and why not support it for MON
> >>>>> groups?
> >>>>
> >>>> There can be only one PLZA configuration in a system. The values in
> >>>> the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID,
> >>>> CLOSID_EN) must be identical across all logical processors. The only
> >>>> field that may differ is PLZA_EN.
> >>
> >> Does this have any effect on hypervisors?
> >
> > Because hypervisor runs at CPL0, there could be some use case. I have
> > not completely understood that part.
> >
> >>
> >>>
> >>> ah - this is a significant part that I missed. Since this is a per-
> >>> CPU register it seems
> >>
> >> I also missed that.
> >>
> >>> to have the ability for expanded use in the future where different
> >>> CLOSID and RMID may be
> >>> written to it? Is PLZA leaving room for such future enhancement or
> >>> does the spec contain
> >>> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC
> >>> register (RMID, RMID_EN,
> >>> CLOSID, CLOSID_EN) must be identical across all logical processors."?
> >>> That is, "forever
> >>> and always"?
> >>>
> >>> If I understand correctly MPAM could have different PARTID and PMG
> >>> for kernel use so we
> >>> need to consider these different architectural behaviors.
> >>
> >> Yes, MPAM has a per-cpu register MPAM1_EL1.
> >>
> >
> > oh ok.
> >
> >>>
> >>>> I was initially unsure which RMID should be used when PLZA is
> >>>> enabled on MON groups.
> >>>>
> >>>> After re-evaluating, enabling PLZA on MON groups is still feasible:
> >>>>
> >>>> 1. Only one group in the system can have PLZA enabled.
> >>>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA
> >>>> on MON group.
> >>>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and
> >>>> RMID of the CTRL_MON group can be written.
> >>>> 4. If PLZA is enabled on a MON group, then the CLOSID of the
> >>>> CTRL_MON group can be used, while the RMID of the MON group can be
> >>>> written.
> >>
> >> Given that CLOSID and RMID are fixed once in the PLZA configuration
> >> could this be simplified by just assuming they have the values of the
> >> default group, CLOSID=0 and RMID=0 and let the user base there
> >> configuration on that?
> >>
> >
> > I didn't understand this question. There are 16 CLOSIDs and 1024 RMIDs.
> > We can use any one of these to enable PLZA. It is not fixed in that sense.
>
> Sorry, I wasn't clear. What I'm trying to understand is what you gain by
> this flexibility. Given that the values CLOSID and the RMID are just
> identifiers within the hardware and have only the meaning they are given
> by the grouping and controls/monitors set up by resctrl (or any other
> software interface) would you lose anything by just saying the PLZA
> group has CLOSID=0 and RMID=0. Is there value in changing the PLZA
> CLOSID and RMID or can the same effect happen by just changing the
> resctrl configuration?
>
Not quite.
When you enter the kernel, you want to run unthrottled to avoid
priority inversion situations.
But at the same time, you still want to be able to monitor the
bandwidth for your thread or job, i..e, keep the same
RMID you have in user space.
The kernel is by construction shared by all threads running in the
system. It should run unrestricted or with the
bandwidth allocated to the highest priority tasks.
PLZA should not change the RMID at all.
You could obtain the same effect by changing the quote for each CLOSID
entering the kernel. But that would likely be more expensive
and you would have to do this for every possible entry and exit point
(restore on exit).
> I was also wondering if using the default group this way would mean that
> you wouldn't need to reserve the group for only kernel use.
>
> >
> >
> >>>>
> >>>> I am thinking this approach should work.
> >>>>
> >>>>>
> >>>>> Limiting it to a single CTRL group seems restrictive in a few ways:
> >>>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This
> >>>>> reduces the
> >>>>> number of use cases that can be supported. Consider, for
> >>>>> example, an existing
> >>>>> "high priority" resource group and a "low priority" resource
> >>>>> group. The user may
> >>>>> just want to let the tasks in the "low priority" resource
> >>>>> group run as "high priority"
> >>>>> when in CPL0. This of course may depend on what resources are
> >>>>> allocated, for example
> >>>>> cache may need more care, but if, for example, user is only
> >>>>> interested in memory
> >>>>> bandwidth allocation this seems a reasonable use case?
> >>>>> 2) Similar to what Tony [1] mentioned this does not enable what the
> >>>>> hardware is
> >>>>> capable of in terms of number of different control groups/
> >>>>> CLOSID that can be
> >>>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one
> >>>>> CLOSID?
> >>>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC
> >>>>> similar to
> >>>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user
> >>>>> space to, for
> >>>>> example, create a resource group that contains tasks of
> >>>>> interest and create
> >>>>> a monitor group within it that monitors all tasks' bandwidth
> >>>>> usage when in CPL0.
> >>>>> This will give user space better insight into system behavior
> >>>>> and from what I can
> >>>>> tell is supported by the feature but not enabled?
> >>>>
> >>>>
> >>>> Yes, as long as PLZA is enabled on only one group in the entire system
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>>> 2) It can't be the root/default group
> >>>>>>>
> >>>>>>> This is something I added to keep the default group in a un-
> >>>>>>> disturbed,
> >>>>>
> >>>>> Why was this needed?
> >>>>>
> >>>>
> >>>> With the new approach mentioned about we can enable in default group
> >>>> also.
> >>>>
> >>>>>>>
> >>>>>>>> 3) It can't have sub monitor groups
> >>>>>
> >>>>> Why not?
> >>>>
> >>>> Ditto. With the new approach mentioned about we can enable in
> >>>> default group also.
> >>>>
> >>>>>
> >>>>>>>> 4) It can't be pseudo-locked
> >>>>>>>
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Would a potential use case involve putting *all* tasks into the
> >>>>>>>> PLZA group? That
> >>>>>>>> would avoid any additional context switch overhead as the PLZA
> >>>>>>>> MSR would never
> >>>>>>>> need to change.
> >>>>>>>
> >>>>>>> Yes. That can be one use case.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> If that is the case, maybe for the PLZA group we should allow
> >>>>>>>> user to
> >>>>>>>> do:
> >>>>>>>>
> >>>>>>>> # echo '*' > tasks
> >>>>>
> >>>>> Dedicating a resource group to "PLZA" seems restrictive while also
> >>>>> adding many
> >>>>> complications since this designation makes resource group behave
> >>>>> differently and
> >>>>> thus the files need to get extra "treatments" to handle this "PLZA"
> >>>>> designation.
> >>>>>
> >>>>> I am wondering if it will not be simpler to introduce just one new
> >>>>> file, for example
> >>>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space
> >>>>> writes a task ID to the
> >>>>> file it "enables" PLZA for this task and that group's CLOSID and
> >>>>> RMID is the associated
> >>>>> task's "PLZA" CLOSID and RMID. This gives user space the
> >>>>> flexibility to use the same
> >>>>> resource group to manage user space and kernel space allocations
> >>>>> while also supporting
> >>>>> various monitoring use cases. This still supports the "dedicate a
> >>>>> resource group to PLZA"
> >>>>> use case where user space can create a new resource group with
> >>>>> certain allocations but the
> >>>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks
> >>>>> needing to run with
> >>>>> the resource group's allocations when in CPL0.
> >>>>
> >>>> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
> >>>>
> >>>> We need make sure only one group can configured in the system and
> >>>> not allow in other groups when it is already enabled.
> >>>
> >>> As I understand this means that only one group can have content in its
> >>> tasks_cpl0/tasks_kernel file. There should not be any special
> >>> handling for
> >>> the remaining files of the resource group since the resource group is
> >>> not
> >>> dedicated to kernel work and can be used as a user space resource
> >>> group also.
> >>> If user space wants to create a dedicated kernel resource group there
> >>> can be
> >>> a new resource group with an empty tasks file.
> >>>
> >>> hmmm ... but if user space writes a task ID to a tasks_cpl0/
> >>> tasks_kernel file then
> >>> resctrl would need to create new syntax to remove that task ID.
> >>>
> >>> Possibly MPAM can build on this by allowing user space to write to
> >>> multiple
> >>> tasks_cpl0/tasks_kernel files? (and the next version of PLZA may too)
> >>>
> >>> Reinette
> >>>
> >>>
> >>>>
> >>>> Thanks
> >>>> Babu
> >>>>
> >>>>>
> >>>>> Reinette
> >>>>>
> >>>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >> Thanks,
> >>
> >> Ben
> >>
> >>
> >
>
> Thanks,
>
> Ben
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-18 6:22 ` Stephane Eranian
@ 2026-02-18 9:35 ` Ben Horgan
2026-02-19 10:27 ` Ben Horgan
0 siblings, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-18 9:35 UTC (permalink / raw)
To: Stephane Eranian
Cc: Moger, Babu, Reinette Chatre, Moger, Babu, Luck, Tony,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Stephane,
On 2/18/26 06:22, Stephane Eranian wrote:
> On Tue, Feb 17, 2026 at 7:56 AM Ben Horgan <ben.horgan@arm.com> wrote:
>>
>> Hi Babu,
>>
>> On 2/16/26 22:52, Moger, Babu wrote:
>>> Hi Ben,
>>>
>>> On 2/16/2026 9:41 AM, Ben Horgan wrote:
>>>> Hi Babu, Reinette,
>>>>
>>>> On 2/14/26 00:10, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 2/13/26 8:37 AM, Moger, Babu wrote:
>>>>>> Hi Reinette,
>>>>>>
>>>>>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>>>>> Babu,
>>>>>>>>>>
>>>>>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>>>>>
>>>>>>>>>> Some useful additions to your explanation.
>>>>>>>>>>
>>>>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>>>>
>>>>>>>>> Yes. Correct.
>>>>>>>
>>>>>>> Why limit it to one CTRL_MON group and why not support it for MON
>>>>>>> groups?
>>>>>>
>>>>>> There can be only one PLZA configuration in a system. The values in
>>>>>> the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID,
>>>>>> CLOSID_EN) must be identical across all logical processors. The only
>>>>>> field that may differ is PLZA_EN.
>>>>
>>>> Does this have any effect on hypervisors?
>>>
>>> Because hypervisor runs at CPL0, there could be some use case. I have
>>> not completely understood that part.
>>>
>>>>
>>>>>
>>>>> ah - this is a significant part that I missed. Since this is a per-
>>>>> CPU register it seems
>>>>
>>>> I also missed that.
>>>>
>>>>> to have the ability for expanded use in the future where different
>>>>> CLOSID and RMID may be
>>>>> written to it? Is PLZA leaving room for such future enhancement or
>>>>> does the spec contain
>>>>> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC
>>>>> register (RMID, RMID_EN,
>>>>> CLOSID, CLOSID_EN) must be identical across all logical processors."?
>>>>> That is, "forever
>>>>> and always"?
>>>>>
>>>>> If I understand correctly MPAM could have different PARTID and PMG
>>>>> for kernel use so we
>>>>> need to consider these different architectural behaviors.
>>>>
>>>> Yes, MPAM has a per-cpu register MPAM1_EL1.
>>>>
>>>
>>> oh ok.
>>>
>>>>>
>>>>>> I was initially unsure which RMID should be used when PLZA is
>>>>>> enabled on MON groups.
>>>>>>
>>>>>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>>>>>
>>>>>> 1. Only one group in the system can have PLZA enabled.
>>>>>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA
>>>>>> on MON group.
>>>>>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and
>>>>>> RMID of the CTRL_MON group can be written.
>>>>>> 4. If PLZA is enabled on a MON group, then the CLOSID of the
>>>>>> CTRL_MON group can be used, while the RMID of the MON group can be
>>>>>> written.
>>>>
>>>> Given that CLOSID and RMID are fixed once in the PLZA configuration
>>>> could this be simplified by just assuming they have the values of the
>>>> default group, CLOSID=0 and RMID=0 and let the user base there
>>>> configuration on that?
>>>>
>>>
>>> I didn't understand this question. There are 16 CLOSIDs and 1024 RMIDs.
>>> We can use any one of these to enable PLZA. It is not fixed in that sense.
>>
>> Sorry, I wasn't clear. What I'm trying to understand is what you gain by
>> this flexibility. Given that the values CLOSID and the RMID are just
>> identifiers within the hardware and have only the meaning they are given
>> by the grouping and controls/monitors set up by resctrl (or any other
>> software interface) would you lose anything by just saying the PLZA
>> group has CLOSID=0 and RMID=0. Is there value in changing the PLZA
>> CLOSID and RMID or can the same effect happen by just changing the
>> resctrl configuration?
>>
> Not quite.
> When you enter the kernel, you want to run unthrottled to avoid
> priority inversion situations.
> But at the same time, you still want to be able to monitor the
> bandwidth for your thread or job, i..e, keep the same
> RMID you have in user space.
Thanks for sharing your usecase.
>
> The kernel is by construction shared by all threads running in the
> system. It should run unrestricted or with the
> bandwidth allocated to the highest priority tasks.
>
> PLZA should not change the RMID at all.
Would the above with RMID_EN=0 give you this usecase?
Unfortunately, this isn't possible when rmid/pmg is scoped to
closid/partid as is the case in MPAM, i.e. the monitors require a match
on the pair (closid, partid). Hence, I think we need to support the case
where both RMID and CLOSID change.
>
> You could obtain the same effect by changing the quote for each CLOSID
> entering the kernel. But that would likely be more expensive
> and you would have to do this for every possible entry and exit point
> (restore on exit).
>
>
>
>> I was also wondering if using the default group this way would mean that
>> you wouldn't need to reserve the group for only kernel use.
>>
>>>
>>>
>>>>>>
>>>>>> I am thinking this approach should work.
>>>>>>
>>>>>>>
>>>>>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>>>>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This
>>>>>>> reduces the
>>>>>>> number of use cases that can be supported. Consider, for
>>>>>>> example, an existing
>>>>>>> "high priority" resource group and a "low priority" resource
>>>>>>> group. The user may
>>>>>>> just want to let the tasks in the "low priority" resource
>>>>>>> group run as "high priority"
>>>>>>> when in CPL0. This of course may depend on what resources are
>>>>>>> allocated, for example
>>>>>>> cache may need more care, but if, for example, user is only
>>>>>>> interested in memory
>>>>>>> bandwidth allocation this seems a reasonable use case?
>>>>>>> 2) Similar to what Tony [1] mentioned this does not enable what the
>>>>>>> hardware is
>>>>>>> capable of in terms of number of different control groups/
>>>>>>> CLOSID that can be
>>>>>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one
>>>>>>> CLOSID?
>>>>>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC
>>>>>>> similar to
>>>>>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user
>>>>>>> space to, for
>>>>>>> example, create a resource group that contains tasks of
>>>>>>> interest and create
>>>>>>> a monitor group within it that monitors all tasks' bandwidth
>>>>>>> usage when in CPL0.
>>>>>>> This will give user space better insight into system behavior
>>>>>>> and from what I can
>>>>>>> tell is supported by the feature but not enabled?
>>>>>>
>>>>>>
>>>>>> Yes, as long as PLZA is enabled on only one group in the entire system
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2) It can't be the root/default group
>>>>>>>>>
>>>>>>>>> This is something I added to keep the default group in a un-
>>>>>>>>> disturbed,
>>>>>>>
>>>>>>> Why was this needed?
>>>>>>>
>>>>>>
>>>>>> With the new approach mentioned about we can enable in default group
>>>>>> also.
>>>>>>
>>>>>>>>>
>>>>>>>>>> 3) It can't have sub monitor groups
>>>>>>>
>>>>>>> Why not?
>>>>>>
>>>>>> Ditto. With the new approach mentioned about we can enable in
>>>>>> default group also.
>>>>>>
>>>>>>>
>>>>>>>>>> 4) It can't be pseudo-locked
>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Would a potential use case involve putting *all* tasks into the
>>>>>>>>>> PLZA group? That
>>>>>>>>>> would avoid any additional context switch overhead as the PLZA
>>>>>>>>>> MSR would never
>>>>>>>>>> need to change.
>>>>>>>>>
>>>>>>>>> Yes. That can be one use case.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If that is the case, maybe for the PLZA group we should allow
>>>>>>>>>> user to
>>>>>>>>>> do:
>>>>>>>>>>
>>>>>>>>>> # echo '*' > tasks
>>>>>>>
>>>>>>> Dedicating a resource group to "PLZA" seems restrictive while also
>>>>>>> adding many
>>>>>>> complications since this designation makes resource group behave
>>>>>>> differently and
>>>>>>> thus the files need to get extra "treatments" to handle this "PLZA"
>>>>>>> designation.
>>>>>>>
>>>>>>> I am wondering if it will not be simpler to introduce just one new
>>>>>>> file, for example
>>>>>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space
>>>>>>> writes a task ID to the
>>>>>>> file it "enables" PLZA for this task and that group's CLOSID and
>>>>>>> RMID is the associated
>>>>>>> task's "PLZA" CLOSID and RMID. This gives user space the
>>>>>>> flexibility to use the same
>>>>>>> resource group to manage user space and kernel space allocations
>>>>>>> while also supporting
>>>>>>> various monitoring use cases. This still supports the "dedicate a
>>>>>>> resource group to PLZA"
>>>>>>> use case where user space can create a new resource group with
>>>>>>> certain allocations but the
>>>>>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks
>>>>>>> needing to run with
>>>>>>> the resource group's allocations when in CPL0.
>>>>>>
>>>>>> Yes. We should be able do that. We need both tasks_cpl0 and cpus_cpl0.
>>>>>>
>>>>>> We need make sure only one group can configured in the system and
>>>>>> not allow in other groups when it is already enabled.
>>>>>
>>>>> As I understand this means that only one group can have content in its
>>>>> tasks_cpl0/tasks_kernel file. There should not be any special
>>>>> handling for
>>>>> the remaining files of the resource group since the resource group is
>>>>> not
>>>>> dedicated to kernel work and can be used as a user space resource
>>>>> group also.
>>>>> If user space wants to create a dedicated kernel resource group there
>>>>> can be
>>>>> a new resource group with an empty tasks file.
>>>>>
>>>>> hmmm ... but if user space writes a task ID to a tasks_cpl0/
>>>>> tasks_kernel file then
>>>>> resctrl would need to create new syntax to remove that task ID.
>>>>>
>>>>> Possibly MPAM can build on this by allowing user space to write to
>>>>> multiple
>>>>> tasks_cpl0/tasks_kernel files? (and the next version of PLZA may too)
>>>>>
>>>>> Reinette
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Babu
>>>>>>
>>>>>>>
>>>>>>> Reinette
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Ben
>>>>
>>>>
>>>
>>
>> Thanks,
>>
>> Ben
>>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 16:38 ` Babu Moger
@ 2026-02-18 9:54 ` Ben Horgan
0 siblings, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-18 9:54 UTC (permalink / raw)
To: Babu Moger, Moger, Babu, Reinette Chatre, Luck, Tony
Cc: corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Babu,
On 2/17/26 16:38, Babu Moger wrote:
> Hi Ben,
>
> On 2/17/26 09:56, Ben Horgan wrote:
>> Hi Babu,
>>
>> On 2/16/26 22:52, Moger, Babu wrote:
>>> Hi Ben,
>>>
>>> On 2/16/2026 9:41 AM, Ben Horgan wrote:
>>>> Hi Babu, Reinette,
>>>>
>>>> On 2/14/26 00:10, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 2/13/26 8:37 AM, Moger, Babu wrote:
>>>>>> Hi Reinette,
>>>>>>
>>>>>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>>>>
>>>>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>>>>> Babu,
>>>>>>>>>>
>>>>>>>>>> I've read a bit more of the code now and I think I understand
>>>>>>>>>> more.
>>>>>>>>>>
>>>>>>>>>> Some useful additions to your explanation.
>>>>>>>>>>
>>>>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>>>> Yes. Correct.
>>>>>>> Why limit it to one CTRL_MON group and why not support it for MON
>>>>>>> groups?
>>>>>> There can be only one PLZA configuration in a system. The values in
>>>>>> the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID,
>>>>>> CLOSID_EN) must be identical across all logical processors. The only
>>>>>> field that may differ is PLZA_EN.
>>>> Does this have any effect on hypervisors?
>>> Because hypervisor runs at CPL0, there could be some use case. I have
>>> not completely understood that part.
>>>
>>>>> ah - this is a significant part that I missed. Since this is a per-
>>>>> CPU register it seems
>>>> I also missed that.
>>>>
>>>>> to have the ability for expanded use in the future where different
>>>>> CLOSID and RMID may be
>>>>> written to it? Is PLZA leaving room for such future enhancement or
>>>>> does the spec contain
>>>>> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC
>>>>> register (RMID, RMID_EN,
>>>>> CLOSID, CLOSID_EN) must be identical across all logical processors."?
>>>>> That is, "forever
>>>>> and always"?
>>>>>
>>>>> If I understand correctly MPAM could have different PARTID and PMG
>>>>> for kernel use so we
>>>>> need to consider these different architectural behaviors.
>>>> Yes, MPAM has a per-cpu register MPAM1_EL1.
>>>>
>>> oh ok.
>>>
>>>>>> I was initially unsure which RMID should be used when PLZA is
>>>>>> enabled on MON groups.
>>>>>>
>>>>>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>>>>>
>>>>>> 1. Only one group in the system can have PLZA enabled.
>>>>>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA
>>>>>> on MON group.
>>>>>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and
>>>>>> RMID of the CTRL_MON group can be written.
>>>>>> 4. If PLZA is enabled on a MON group, then the CLOSID of the
>>>>>> CTRL_MON group can be used, while the RMID of the MON group can be
>>>>>> written.
>>>> Given that CLOSID and RMID are fixed once in the PLZA configuration
>>>> could this be simplified by just assuming they have the values of the
>>>> default group, CLOSID=0 and RMID=0 and let the user base there
>>>> configuration on that?
>>>>
>>> I didn't understand this question. There are 16 CLOSIDs and 1024 RMIDs.
>>> We can use any one of these to enable PLZA. It is not fixed in that
>>> sense.
>> Sorry, I wasn't clear. What I'm trying to understand is what you gain by
>> this flexibility. Given that the values CLOSID and the RMID are just
>> identifiers within the hardware and have only the meaning they are given
>> by the grouping and controls/monitors set up by resctrl (or any other
>> software interface) would you lose anything by just saying the PLZA
>> group has CLOSID=0 and RMID=0. Is there value in changing the PLZA
>> CLOSID and RMID or can the same effect happen by just changing the
>> resctrl configuration?
>>
>> I was also wondering if using the default group this way would mean that
>> you wouldn't need to reserve the group for only kernel use.
>
> Yes, that is an option, but it becomes too restrictive. Would this
> approach work for the ARM implementation?
As a minimum, a fixed partid/pmg would be ok for mpam but we would want
to be able to grow features from this. (Changeable partid/pmg can also
we work.) In either case, we wouldn't want the partid to be necessarily
exclusive to the kernel as some configuration assign one partid to each
cpu and the number of partids is often equal to the number of cpus.
>
> If a user wants to keep a selective set of tasks running at different
> allocation levels, they would need to create another new group, move all
> tasks from default group to new group, and leave only the selected
> tasks in the default group.
>
> And if that group is later deleted, all tasks will automatically return
> to the default group.
>
> Thanks,
> Babu
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 23:55 ` Reinette Chatre
@ 2026-02-18 16:44 ` Luck, Tony
2026-02-19 17:03 ` Luck, Tony
` (2 more replies)
0 siblings, 3 replies; 114+ messages in thread
From: Luck, Tony @ 2026-02-18 16:44 UTC (permalink / raw)
To: Reinette Chatre
Cc: Ben Horgan, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
On Tue, Feb 17, 2026 at 03:55:44PM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 2/17/26 2:52 PM, Luck, Tony wrote:
> > On Tue, Feb 17, 2026 at 02:37:49PM -0800, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 2/17/26 1:44 PM, Luck, Tony wrote:
> >>>>>>> I'm not sure if this would happen in the real world or not.
> >>>>>>
> >>>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
> >>>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
> >>>>>
> >>>>> Indeed. This is all getting a bit complicated.
> >>>>>
> >>>>
> >>>> ack
> >>>
> >>> We have several proposals so far:
> >>>
> >>> 1) Ben's suggestion to use the default group (either with a Babu-style
> >>> "plza" file just in that group, or a configuration file under "info/").
> >>>
> >>> This is easily the simplest for implementation, but has no flexibility.
> >>> Also requires users to move all the non-critical workloads out to other
> >>> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
> >>>
> >>> 2) My thoughts are for a separate group that is only used to configure
> >>> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
> >>> are used for all tasks when in kernel mode.
> >>>
> >>> No context switch overhead. Has some flexibility.
> >>>
> >>> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
> >>> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
> >>> group in addition to belonging to another group than defines schemata
> >>> resources when running in non-kernel mode.
> >>> Tasks aren't required to be in the kernel group, in which case they
> >>> keep the same CLOSID in both user and kernel mode. When used in this
> >>> way there will be context switch overhead when changing between tasks
> >>> with different kernel CLOSID/RMID.
> >>>
> >>> 4) Even more complex scenarios with more than one user configurable
> >>> kernel group to give more options on resources available in the kernel.
> >>>
> >>>
> >>> I had a quick pass as coding my option "2". My UI to designate the
> >>> group to use for kernel mode is to reserve the name "kernel_group"
> >>> when making CTRL_MON groups. Some tweaks to avoid creating the
> >>> "tasks", "cpus", and "cpus_list" files (which might be done more
> >>> elegantly), and "mon_groups" directory in this group.
> >>
> >> Should the decision of whether context switch overhead is acceptable
> >> not be left up to the user?
> >
> > When someone comes up with a convincing use case to support one set of
> > kernel resources when interrupting task A, and a different set of
> > resources when interrupting task B, we should certainly listen.
>
> Absolutely. Someone can come up with such use case at any time tough. This
> could be, and as has happened with some other resctrl interfaces, likely will be
> after this feature has been supported for a few kernel versions. What timeline
> should we give which users to share their use cases with us? Even if we do hear
> from some users will that guarantee that no such use case will arise in the
> future? Such predictions of usage are difficult for me and I thus find it simpler
> to think of flexible ways to enable the features that we know the hardware supports.
>
> This does not mean that a full featured solution needs to be implemented from day 1.
> If folks believe there are "no valid use cases" today resctrl still needs to prepare for
> how it can grow to support full hardware capability and hardware designs in the
> future.
>
> Also, please also consider not just resources for kernel work but also monitoring for
> kernel work. I do think, for example, a reasonable use case may be to determine
> how much memory bandwidth the kernel uses on behalf of certain tasks.
>
> >> I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
> >> the needed registers will only be updated if there is a new CLOSID/RMID needed
> >> for kernel space.
> >
> > Babu's RFC does this.
>
> Right.
>
> >
> >> Are you suggesting that just this checking itself is too
> >> expensive to justify giving user space more flexibility by fully enabling what
> >> the hardware supports? If resctrl does draw such a line to not enable what
> >> hardware supports it should be well justified.
> >
> > The check is likley light weight (as long as the variables to be
> > compared reside in the same cache lines as the exisitng CLOSID
> > and RMID checks). So if there is a use case for different resources
> > when in kernel mode, then taking this path will be fine.
>
> Why limit this to knowing about a use case? As I understand this feature can be
> supported in a flexible way without introducing additional context switch overhead
> if the user prefers to use just one allocation for all kernel work. By being
> configurable and allowing resctrl to support more use cases in the future resctrl
> does not paint itself into a corner. This allows resctrl to grow support so that
> the user can use all capabilities of the hardware with understanding that it will
> increase context switch time.
>
> Reinette
How about this idea for extensibility.
Rename Babu's "plza" file to "plza_mode". Instead of just being an
on/off switch, it may accept multiple possible requests.
Humorous version:
# echo "babu" > plza_mode
This results in behavior of Babu's RFC. The CLOSID and RMID assigned to
the CTRL_MON group are used when in kernel mode, but only for tasks that
have their task-id written to the "tasks" file or for tasks in the
default group in the "cpus" or "cpus_list" files are used to assign
CPUs to this group.
# echo "tony" > plza_mode
All tasks run with the CLOSID/RMID for this group. The "tasks", "cpus" and
"cpus_list" files and the "mon_groups" directory are removed.
# echo "ben" > plza_mode"
Only usable in the top-level default CTRL_MON directory. CLOSID=0/RMID=0
are used for all tasks in kernel mode.
# echo "stephane" > plza_mode
The RMID for this group is freed. All tasks run in kernel mode with the
CLOSID for this group, but use same RMID for both user and kernel.
In addition to files removed in "tony" mode, the mon_data directory is
removed.
# echo "some-future-name" > plza_mode
Somebody has a new use case. Resctrl can be extended by allowing some
new mode.
Likely real implementation:
Sub-components of each of the ideas above are encoded as a bitmask that
is written to plza_mode. There is a file in the info/ directory listing
which bits are supported on the current system (e.g. the "keep the same
RMID" mode may be impractical on ARM, so it would not be listed as an
option.)
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 18:51 ` Reinette Chatre
2026-02-17 21:44 ` Luck, Tony
@ 2026-02-19 10:21 ` Ben Horgan
2026-02-19 18:14 ` Reinette Chatre
1 sibling, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-19 10:21 UTC (permalink / raw)
To: Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Reinette,
On 2/17/26 18:51, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/16/26 7:18 AM, Ben Horgan wrote:
>> On Thu, Feb 12, 2026 at 10:37:21AM -0800, Reinette Chatre wrote:
>>> On 2/12/26 5:55 AM, Ben Horgan wrote:
>>>> On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
>>>>> On 2/11/26 8:40 AM, Ben Horgan wrote:
>>>>>> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>
>>>>>>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>>>>>>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>>>>>>> instead of CPL0 using something like "kernel" or ... ?
>>>>>>
>>>>>> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
>>>>>> internally and here are a few thoughts.
>>>>>>
>>>>>> If the user case is just that an option run all tasks with the same closid/rmid
>>>>>> (partid/pmg) configuration when they are running in the kernel then I'd favour a
>>>>>> mount option. The resctrl filesytem interface doesn't need to change and
>>>>>
>>>>> I view mount options as an interface of last resort. Why would a mount option be needed
>>>>> in this case? The existence of the file used to configure the feature seems sufficient?
>>>>
>>>> If we are taking away a closid from the user then the number of CTRL_MON groups
>>>> that can be created changes. It seems reasonable for user-space to expect
>>>> num_closid to be a fixed value.
>>>
>>> I do you see why we need to take away a CLOSID from the user. Consider a user space that
>>
>> Yes, just slightly simpler to take away a CLOSID but could just go with the
>> default CLOSID is also used for the kernel. I would be ok with a file saying the
>> mode, like the mbm_event file does for counter assignment. It slightly misleading
>> that a configuration file is under info but necessary as we don't have another
>> location global to the resctrl mount.
>
> Indeed, the "info" directory has evolved more into a "config" directory.
>
>>> runs with just two resource groups, for example, "high priority" and "low priority", it seems
>>> reasonable to make it possible to let the "low priority" tasks run with "high priority"
>>> allocations when in kernel space without needing to dedicate a new CLOSID? More reasonable
>>> when only considering memory bandwidth allocation though.
>>>
>>>>
>>>>>
>>>>> Also ...
>>>>>
>>>>> I do not think resctrl should unnecessarily place constraints on what the hardware
>>>>> features are capable of. As I understand, both PLZA and MPAM supports use case where
>>>>> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
>>>>> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
>>>>> This may be because I am not familiar with all the requirements here so please do
>>>>> help with insight on how the hardware feature is intended to be used as it relates
>>>>> to its design.
>>>>>
>>>>> We have to be very careful when constraining a feature this much If resctrl does something
>>>>> like this it essentially restricts what users could do forever.
>>>>
>>>> Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
>>>> fixed kernel CLOSID/RMID configuration option might just give all we need for
>>>> usecases we know we have and be minimally intrusive enough to not preclude a
>>>> more featureful PLZA later when new usecases come about.
>>>
>>> Having ability to grow features would be ideal. I do not see how a fixed kernel CLOSID/RMID
>>> configuration leaves room to build on top though. Could you please elaborate?
>>
>> If we initially go with a single new configuration file, e.g. kernel_mode, which
>> could be "match_user" or "use_root, this would be the only initial change to the
>> interface needed. If more usecases present themselves a new mode could be added,
>> e.g. "configurable", and an interface to actually change the rmid/closid for the
>> kernel could be added.
>
> Something like this could be a base to work from. I think only the two ("match_user" and
> "use_root") are a bit limiting for even the initial implementation though.
> As I understand, "use_root" implies using the allocations of the default group but
> does not indicate what MON group (which RMID/PMG) should be used to monitor the
> work done in kernel space. A way to specify the actual group may be needed?
Yeah, I'm not sure that flexibility is strictly necessary but will make
the interface easier to use.
>
>>> I wonder if the benefit of the fixed CLOSID/RMID is perhaps mostly in the cost of
>>> context switching which I do not think is a concern for MPAM but it may be for PLZA?
>>>
>>> One option to support fixed kernel CLOSID/RMID at the beginning and leave room to build
>>> may be to create the kernel_group or "tasks_kernel" interface as a baseline but in first
>>> implementation only allow user space to write the same group to all "kernel_group" files or
>>> to only allow to write to one of the "tasks_kernel" files in the resctrl fs hierarchy. At
>>> that time the associated CLOSID/RMID would become the "fixed configuration" and attempts to
>>> write to others can return "ENOSPC"?
>>
>> I think we'd have to be sure of the final interface if we go this way.
>
> I do not think we should aim to know the final interface since that requires knowing all future
> hardware features and their implementations in advance. Instead we should aim to have something
> that we can build on that is accompanied by documentation that supports future flexibility (some may
> refer to this as "weasel words").
Makes sense.
>
>>> From what I can tell this still does not require to take away a CLOSID/RMID from user space
>>> though. Dedicating a CLOSID/RMID to kernel work can still be done but be in control of user
>>> that can, for example leave the "tasks" and "cpus" files empty.
>>>
>>>> One complication is that for fixed kernel CLOSID/RMID option is that for x86 you
>>>> may want to be able to monitor a tasks resource usage whether or not it is in
>>>> the kernel or userspace and so only have a fixed CLOSID. However, for MPAM this
>>>> wouldn't work as PMG (~RMID) is scoped to PARTID (~CLOSID).
>>>>
>>>>>
>>>>>> userspace software doesn't need to change. This could either take away a
>>>>>> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
>>>>>> policy to have the default group as the kernel group. If you use the default
>>>>>
>>>>> Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
>>>>> between user space and kernel. I do not see a motivation for resctrl to place such
>>>>> constraint.
>>>>>
>>>>>> configuration, at least for MPAM, the kernel may not be running at the highest
>>>>>> priority as a minimum bandwidth can be used to give a priority boost. (Once we
>>>>>> have a resctrl schema for this.)
>>>>>>
>>>>>> It could be useful to have something a bit more featureful though. Is there a
>>>>>> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
>>>>>> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
>>>>>> it's not a single write to move a task. If a single mapping is sufficient, then
>>>>>
>>>>> Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
>>>>> I think the MPAM approach is better and there may be opportunity to do this in a similar
>>>>> way and both architectures use the same field(s) in the task_struct.
>>>>
>>>> I was referring to the userspace file write but unifying on a the same fields in
>>>> task_struct could be good. The single write is necessary for MPAM as PMG is
>>>> scoped to PARTID and I don't think x86 behaviour changes if it moves to the same
>>>> approach.
>>>>
>>>
>>> ah - I misunderstood. You are suggesting to have one file that user writes to
>>> to set both user space and kernel space CLOSID/RMID? This sounds like what the
>>
>> Yes, the kernel_groups idea does partially have this as once you've set the
>> kernel_group for a CTRL_MON or MON group then the user space configuration
>> dictates the kernel space configuration. As you pointed out, this is also
>> a draw back of the kernel_groups idea.
>>
>>> existing "tasks" file does but only supports the same CLOSID/RMID for both user
>>> space and kernel space. To support the new hardware features where the CLOSID/RMID
>>> can be different we cannot just change "tasks" interface and would need to keep it
>>> backward compatible. So far I assumed that it would be ok for the "tasks" file
>>> to essentially get new meaning as the CLOSID/RMID for just user space work, which
>>> seems to require a second file for kernel space as a consequence? So far I have
>>> not seen an option that does not change meaning of the "tasks" file.
>>
>> Would it make sense to have some new type of entries in the tasks file,
>> e.g. k_ctrl_<pid>, k_mon_<pid> to say, in the kernel, use the closid of this
>> CTRL_MON for this task pid or use the rmid of this CTRL_MON/MON group for this task
>> pid? We would still probably need separate files for the cpu configuration.
>
> I am obligated to nack such a change to the tasks file since it would impact any
> existing user space parsing of this file.
>
Good to know. Do you consider the format of the tasks file fully fixed?
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-18 9:35 ` Ben Horgan
@ 2026-02-19 10:27 ` Ben Horgan
0 siblings, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-19 10:27 UTC (permalink / raw)
To: Stephane Eranian
Cc: Moger, Babu, Reinette Chatre, Moger, Babu, Luck, Tony,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Stephane,
On 2/18/26 09:35, Ben Horgan wrote:
> Hi Stephane,
>
> On 2/18/26 06:22, Stephane Eranian wrote:
>> On Tue, Feb 17, 2026 at 7:56 AM Ben Horgan <ben.horgan@arm.com> wrote:
>>>
>>> Hi Babu,
>>>
>>> On 2/16/26 22:52, Moger, Babu wrote:
>>>> Hi Ben,
>>>>
>>>> On 2/16/2026 9:41 AM, Ben Horgan wrote:
>>>>> Hi Babu, Reinette,
>>>>>
>>>>> On 2/14/26 00:10, Reinette Chatre wrote:
>>>>>> Hi Babu,
>>>>>>
>>>>>> On 2/13/26 8:37 AM, Moger, Babu wrote:
>>>>>>> Hi Reinette,
>>>>>>>
>>>>>>> On 2/10/2026 10:17 AM, Reinette Chatre wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>>>>>> Babu,
>>>>>>>>>>>
>>>>>>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>>>>>>
>>>>>>>>>>> Some useful additions to your explanation.
>>>>>>>>>>>
>>>>>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>>>>>
>>>>>>>>>> Yes. Correct.
>>>>>>>>
>>>>>>>> Why limit it to one CTRL_MON group and why not support it for MON
>>>>>>>> groups?
>>>>>>>
>>>>>>> There can be only one PLZA configuration in a system. The values in
>>>>>>> the MSR_IA32_PQR_PLZA_ASSOC register (RMID, RMID_EN, CLOSID,
>>>>>>> CLOSID_EN) must be identical across all logical processors. The only
>>>>>>> field that may differ is PLZA_EN.
>>>>>
>>>>> Does this have any effect on hypervisors?
>>>>
>>>> Because hypervisor runs at CPL0, there could be some use case. I have
>>>> not completely understood that part.
>>>>
>>>>>
>>>>>>
>>>>>> ah - this is a significant part that I missed. Since this is a per-
>>>>>> CPU register it seems
>>>>>
>>>>> I also missed that.
>>>>>
>>>>>> to have the ability for expanded use in the future where different
>>>>>> CLOSID and RMID may be
>>>>>> written to it? Is PLZA leaving room for such future enhancement or
>>>>>> does the spec contain
>>>>>> the text that state "The values in the MSR_IA32_PQR_PLZA_ASSOC
>>>>>> register (RMID, RMID_EN,
>>>>>> CLOSID, CLOSID_EN) must be identical across all logical processors."?
>>>>>> That is, "forever
>>>>>> and always"?
>>>>>>
>>>>>> If I understand correctly MPAM could have different PARTID and PMG
>>>>>> for kernel use so we
>>>>>> need to consider these different architectural behaviors.
>>>>>
>>>>> Yes, MPAM has a per-cpu register MPAM1_EL1.
>>>>>
>>>>
>>>> oh ok.
>>>>
>>>>>>
>>>>>>> I was initially unsure which RMID should be used when PLZA is
>>>>>>> enabled on MON groups.
>>>>>>>
>>>>>>> After re-evaluating, enabling PLZA on MON groups is still feasible:
>>>>>>>
>>>>>>> 1. Only one group in the system can have PLZA enabled.
>>>>>>> 2. If PLZA is enabled on CTRL_MON group then we cannot enable PLZA
>>>>>>> on MON group.
>>>>>>> 3. If PLZA is enabled on the CTRL_MON group, then the CLOSID and
>>>>>>> RMID of the CTRL_MON group can be written.
>>>>>>> 4. If PLZA is enabled on a MON group, then the CLOSID of the
>>>>>>> CTRL_MON group can be used, while the RMID of the MON group can be
>>>>>>> written.
>>>>>
>>>>> Given that CLOSID and RMID are fixed once in the PLZA configuration
>>>>> could this be simplified by just assuming they have the values of the
>>>>> default group, CLOSID=0 and RMID=0 and let the user base there
>>>>> configuration on that?
>>>>>
>>>>
>>>> I didn't understand this question. There are 16 CLOSIDs and 1024 RMIDs.
>>>> We can use any one of these to enable PLZA. It is not fixed in that sense.
>>>
>>> Sorry, I wasn't clear. What I'm trying to understand is what you gain by
>>> this flexibility. Given that the values CLOSID and the RMID are just
>>> identifiers within the hardware and have only the meaning they are given
>>> by the grouping and controls/monitors set up by resctrl (or any other
>>> software interface) would you lose anything by just saying the PLZA
>>> group has CLOSID=0 and RMID=0. Is there value in changing the PLZA
>>> CLOSID and RMID or can the same effect happen by just changing the
>>> resctrl configuration?
>>>
>> Not quite.
>> When you enter the kernel, you want to run unthrottled to avoid
>> priority inversion situations.
In cases where you want to reserve a cache region for one program you
may want to avoid giving the kernel access to that region so it doesn't
pollute it.
>> But at the same time, you still want to be able to monitor the
>> bandwidth for your thread or job, i..e, keep the same
>> RMID you have in user space.
>
> Thanks for sharing your usecase.
>
>>
>> The kernel is by construction shared by all threads running in the
>> system. It should run unrestricted or with the
>> bandwidth allocated to the highest priority tasks.
>>
>> PLZA should not change the RMID at all.
>
> Would the above with RMID_EN=0 give you this usecase?
>
> Unfortunately, this isn't possible when rmid/pmg is scoped to
> closid/partid as is the case in MPAM, i.e. the monitors require a match
> on the pair (closid, partid). Hence, I think we need to support the case
> where both RMID and CLOSID change.
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-17 21:44 ` Luck, Tony
2026-02-17 22:37 ` Reinette Chatre
@ 2026-02-19 11:06 ` Ben Horgan
2026-02-19 18:12 ` Luck, Tony
1 sibling, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-19 11:06 UTC (permalink / raw)
To: Luck, Tony, Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Drew Fustini, corbet@lwn.net,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, peterz@infradead.org,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
mgorman@suse.de, vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal, Zeng Heng
+Zeng
Discussion of resctrl configuration when running in the kernel. You
previously commented on similar on the MPAM series.
On 2/17/26 21:44, Luck, Tony wrote:
>>>>> I'm not sure if this would happen in the real world or not.
>>>>
>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>>>
>>> Indeed. This is all getting a bit complicated.
>>>
>>
>> ack
>
> We have several proposals so far:
Thanks for making the summary.
>
> 1) Ben's suggestion to use the default group (either with a Babu-style
> "plza" file just in that group, or a configuration file under "info/").
>
> This is easily the simplest for implementation, but has no flexibility.
> Also requires users to move all the non-critical workloads out to other
> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
>
> 2) My thoughts are for a separate group that is only used to configure
> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
> are used for all tasks when in kernel mode.
If you mark the group somehow can you avoid a dedicated CLOSID/RMID
pair? (MPAM systems are often sized so you can use a partid per cpu,
num_partids=num_cpus.) The tasks/cpus files in the group could just be
used as normal for userspace configuration.
For 1,2 I think we need to be able to have an option where the rmid is
fixed to the userspace value and one where it isn't.
>
> No context switch overhead. Has some flexibility.
>
> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
> group in addition to belonging to another group than defines schemata
> resources when running in non-kernel mode.
> Tasks aren't required to be in the kernel group, in which case they
> keep the same CLOSID in both user and kernel mode. When used in this
> way there will be context switch overhead when changing between tasks
> with different kernel CLOSID/RMID.
>
> 4) Even more complex scenarios with more than one user configurable
> kernel group to give more options on resources available in the kernel.
If we are going to add more files to resctrl we should perhaps think of
a reserved prefix so they won't conflict with the names of user created
CTRL_MON groups.
>
> > I had a quick pass as coding my option "2". My UI to designate the
> group to use for kernel mode is to reserve the name "kernel_group"
I think we need to keep it possible for the kernel group to be the
default group. In MPAM systems the firmware and MPAM hypervisors are
likely to run with partid=0,pmg=0 and we may want the kernel to follow suit.
I need to check on an rdt system but I think the kernel also always runs
the idle thread with the default configuration. (Writing 0 to the tasks
file changes the configuration for the current task rather than the idle
tasks. I think this is missing from the documentation so I'll create a
patch to add it.)
> when making CTRL_MON groups. Some tweaks to avoid creating the
> "tasks", "cpus", and "cpus_list" files (which might be done more
> elegantly), and "mon_groups" directory in this group.
>
> I just have stubs in the arch/x86 core.c file for enumeration and
> enable/disable. Just realized I'm missing a call to disable on
> unmount of the resctrl file system.
>
> Apart from umount, I think it is more or less complete, and fairly
> compact:
>
> arch/x86/kernel/cpu/resctrl/core.c | 25 +++++++++++++++++++++++++
> fs/resctrl/internal.h | 9 +++++++--
> fs/resctrl/rdtgroup.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> include/linux/resctrl.h | 4 ++++
> 4 files changed, 72 insertions(+), 15 deletions(-)
>
> -Tony
>
> ---
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 006e57fd7ca5..540ab9d7621a 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -702,6 +702,10 @@ bool resctrl_arch_get_io_alloc_enabled(struct rdt_resource *r);
> extern unsigned int resctrl_rmid_realloc_threshold;
> extern unsigned int resctrl_rmid_realloc_limit;
>
> +bool resctrl_arch_kernel_group_is_supported(void);
> +void resctrl_arch_kernel_group_enable(u32 closid, u32 rmid);
> +void resctrl_arch_kernel_group_disable(void);
> +
> int resctrl_init(void);
> void resctrl_exit(void);
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 1a9b29119f88..99fbdcaf3c63 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -156,6 +156,7 @@ extern bool resctrl_mounted;
> enum rdt_group_type {
> RDTCTRL_GROUP = 0,
> RDTMON_GROUP,
> + RDTKERNEL_GROUP,
> RDT_NUM_GROUP,
> };
>
> @@ -245,6 +246,8 @@ struct rdtgroup {
>
> #define RFTYPE_BASE BIT(1)
>
> +#define RFTYPE_TASKS_CPUS BIT(2)
> +
> #define RFTYPE_CTRL BIT(4)
>
> #define RFTYPE_MON BIT(5)
> @@ -267,9 +270,11 @@ struct rdtgroup {
>
> #define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP)
>
> -#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL)
> +#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_TASKS_CPUS | RFTYPE_CTRL)
> +
> +#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_TASKS_CPUS | RFTYPE_MON)
>
> -#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON)
> +#define RFTYPE_KERNEL_BASE (RFTYPE_BASE | RFTYPE_CTRL)
>
> /* List of all resource groups */
> extern struct list_head rdt_all_groups;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 7667cf7c4e94..94d20b200e47 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -733,6 +733,28 @@ static void clear_closid_rmid(int cpu)
> RESCTRL_RESERVED_CLOSID);
> }
>
> +static bool kernel_group_is_enabled;
> +static u32 kernel_group_closid, kernel_group_rmid;
> +
> +bool resctrl_arch_kernel_group_is_supported(void)
> +{
> + return true;
> +}
> +
> +void resctrl_arch_kernel_group_enable(u32 closid, u32 rmid)
> +{
> + pr_info("Enable kernel group on all CPUs here closid=%u rmid=%u\n", closid, rmid);
> + kernel_group_closid = closid;
> + kernel_group_rmid = rmid;
> + kernel_group_is_enabled = true;
> +}
> +
> +void resctrl_arch_kernel_group_disable(void)
> +{
> + pr_info("Disable kernel group on all CPUs here\n");
> + kernel_group_is_enabled = false;
> +}
> +
> static int resctrl_arch_online_cpu(unsigned int cpu)
> {
> struct rdt_resource *r;
> @@ -743,6 +765,9 @@ static int resctrl_arch_online_cpu(unsigned int cpu)
> mutex_unlock(&domain_list_lock);
>
> clear_closid_rmid(cpu);
> + if (kernel_group_is_enabled)
> + pr_info("Enable kernel group on CPU:%d closid=%u rmid=%u\n",
> + cpu, kernel_group_closid, kernel_group_rmid);
> resctrl_online_cpu(cpu);
>
> return 0;
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index ba8d503551cd..0d396569a76a 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -2046,7 +2046,7 @@ static struct rftype res_common_files[] = {
> .kf_ops = &rdtgroup_kf_single_ops,
> .write = rdtgroup_cpus_write,
> .seq_show = rdtgroup_cpus_show,
> - .fflags = RFTYPE_BASE,
> + .fflags = RFTYPE_BASE | RFTYPE_TASKS_CPUS,
> },
> {
> .name = "cpus_list",
> @@ -2055,7 +2055,7 @@ static struct rftype res_common_files[] = {
> .write = rdtgroup_cpus_write,
> .seq_show = rdtgroup_cpus_show,
> .flags = RFTYPE_FLAGS_CPUS_LIST,
> - .fflags = RFTYPE_BASE,
> + .fflags = RFTYPE_BASE | RFTYPE_TASKS_CPUS,
> },
> {
> .name = "tasks",
> @@ -2063,14 +2063,14 @@ static struct rftype res_common_files[] = {
> .kf_ops = &rdtgroup_kf_single_ops,
> .write = rdtgroup_tasks_write,
> .seq_show = rdtgroup_tasks_show,
> - .fflags = RFTYPE_BASE,
> + .fflags = RFTYPE_BASE | RFTYPE_TASKS_CPUS,
> },
> {
> .name = "mon_hw_id",
> .mode = 0444,
> .kf_ops = &rdtgroup_kf_single_ops,
> .seq_show = rdtgroup_rmid_show,
> - .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG,
> + .fflags = RFTYPE_BASE | RFTYPE_MON | RFTYPE_DEBUG,
> },
> {
> .name = "schemata",
> @@ -2078,7 +2078,7 @@ static struct rftype res_common_files[] = {
> .kf_ops = &rdtgroup_kf_single_ops,
> .write = rdtgroup_schemata_write,
> .seq_show = rdtgroup_schemata_show,
> - .fflags = RFTYPE_CTRL_BASE,
> + .fflags = RFTYPE_BASE | RFTYPE_CTRL,
> },
> {
> .name = "mba_MBps_event",
> @@ -2093,14 +2093,14 @@ static struct rftype res_common_files[] = {
> .kf_ops = &rdtgroup_kf_single_ops,
> .write = rdtgroup_mode_write,
> .seq_show = rdtgroup_mode_show,
> - .fflags = RFTYPE_CTRL_BASE,
> + .fflags = RFTYPE_BASE | RFTYPE_CTRL,
> },
> {
> .name = "size",
> .mode = 0444,
> .kf_ops = &rdtgroup_kf_single_ops,
> .seq_show = rdtgroup_size_show,
> - .fflags = RFTYPE_CTRL_BASE,
> + .fflags = RFTYPE_BASE | RFTYPE_CTRL,
> },
> {
> .name = "sparse_masks",
> @@ -2114,7 +2114,7 @@ static struct rftype res_common_files[] = {
> .mode = 0444,
> .kf_ops = &rdtgroup_kf_single_ops,
> .seq_show = rdtgroup_closid_show,
> - .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG,
> + .fflags = RFTYPE_BASE | RFTYPE_CTRL | RFTYPE_DEBUG,
> },
> };
>
> @@ -3788,11 +3788,15 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
> }
>
> if (rtype == RDTCTRL_GROUP) {
> - files = RFTYPE_BASE | RFTYPE_CTRL;
> + files = RFTYPE_CTRL_BASE;
> + if (resctrl_arch_mon_capable())
> + files |= RFTYPE_MON_BASE;
> + } else if (rtype == RDTKERNEL_GROUP) {
> + files = RFTYPE_KERNEL_BASE;
> if (resctrl_arch_mon_capable())
> files |= RFTYPE_MON;
> } else {
> - files = RFTYPE_BASE | RFTYPE_MON;
> + files = RFTYPE_MON_BASE;
> }
>
> ret = rdtgroup_add_files(kn, files);
> @@ -3866,12 +3870,21 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn,
> static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
> const char *name, umode_t mode)
> {
> + enum rdt_group_type rtype = RDTCTRL_GROUP;
> struct rdtgroup *rdtgrp;
> struct kernfs_node *kn;
> u32 closid;
> int ret;
>
> - ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp);
> + if (!strcmp(name, "kernel_group")) {
> + if (!resctrl_arch_kernel_group_is_supported()) {
> + rdt_last_cmd_puts("No support for kernel group\n");
> + return -EINVAL;
> + }
> + rtype = RDTKERNEL_GROUP;
> + }
> +
> + ret = mkdir_rdt_prepare(parent_kn, name, mode, rtype, &rdtgrp);
> if (ret)
> return ret;
>
> @@ -3898,7 +3911,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>
> list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
>
> - if (resctrl_arch_mon_capable()) {
> + if (rtype == RDTCTRL_GROUP && resctrl_arch_mon_capable()) {
> /*
> * Create an empty mon_groups directory to hold the subset
> * of tasks and cpus to monitor.
> @@ -3912,6 +3925,9 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
> rdtgrp->mba_mbps_event = mba_mbps_default_event;
> }
>
> + if (rtype == RDTKERNEL_GROUP)
> + resctrl_arch_kernel_group_enable(rdtgrp->closid, rdtgrp->mon.rmid);
> +
> goto out_unlock;
>
> out_del_list:
> @@ -4005,6 +4021,11 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> u32 closid, rmid;
> int cpu;
>
> + if (rdtgrp->type == RDTKERNEL_GROUP) {
> + resctrl_arch_kernel_group_disable();
> + goto skip_tasks_and_cpus;
> + }
> +
> /* Give any tasks back to the default group */
> rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
>
> @@ -4025,6 +4046,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
> update_closid_rmid(tmpmask, NULL);
>
> +skip_tasks_and_cpus:
> rdtgroup_unassign_cntrs(rdtgrp);
>
> free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
> @@ -4073,7 +4095,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
> * If the rdtgroup is a mon group and parent directory
> * is a valid "mon_groups" directory, remove the mon group.
> */
> - if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn &&
> + if ((rdtgrp->type == RDTCTRL_GROUP || rdtgrp->type == RDTKERNEL_GROUP) &&
> + parent_kn == rdtgroup_default.kn &&
> rdtgrp != &rdtgroup_default) {
> if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
> rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-18 16:44 ` Luck, Tony
@ 2026-02-19 17:03 ` Luck, Tony
2026-02-19 17:45 ` Ben Horgan
2026-02-20 8:21 ` Drew Fustini
2026-02-19 17:33 ` Ben Horgan
2026-02-20 2:53 ` Reinette Chatre
2 siblings, 2 replies; 114+ messages in thread
From: Luck, Tony @ 2026-02-19 17:03 UTC (permalink / raw)
To: Reinette Chatre
Cc: Ben Horgan, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
> Likely real implementation:
>
> Sub-components of each of the ideas above are encoded as a bitmask that
> is written to plza_mode. There is a file in the info/ directory listing
> which bits are supported on the current system (e.g. the "keep the same
> RMID" mode may be impractical on ARM, so it would not be listed as an
> option.)
In x86 terms where control and monitor functions are independent we
have:
Control:
1) Use default (CLOSID==0) for kernel
2) Allocate just one CLOSID for kernel
3) Allocate many CLOSIDs for kernel
Monitor:
1) Do not monitor kernel separately from user
2) Use default (RMID==0) for kernel
3) Allocate one RMID for kernel
4) Allocate many RMIDs for kernel
What options are possible on ARM & RISC-V?
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-18 16:44 ` Luck, Tony
2026-02-19 17:03 ` Luck, Tony
@ 2026-02-19 17:33 ` Ben Horgan
2026-02-20 2:53 ` Reinette Chatre
2 siblings, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-19 17:33 UTC (permalink / raw)
To: Luck, Tony, Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Drew Fustini, corbet@lwn.net,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, peterz@infradead.org,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
mgorman@suse.de, vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Tony,
On 2/18/26 16:44, Luck, Tony wrote:
> On Tue, Feb 17, 2026 at 03:55:44PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 2/17/26 2:52 PM, Luck, Tony wrote:
>>> On Tue, Feb 17, 2026 at 02:37:49PM -0800, Reinette Chatre wrote:
>>>> Hi Tony,
>>>>
>>>> On 2/17/26 1:44 PM, Luck, Tony wrote:
>>>>>>>>> I'm not sure if this would happen in the real world or not.
>>>>>>>>
>>>>>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
>>>>>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>>>>>>>
>>>>>>> Indeed. This is all getting a bit complicated.
>>>>>>>
>>>>>>
>>>>>> ack
>>>>>
>>>>> We have several proposals so far:
>>>>>
>>>>> 1) Ben's suggestion to use the default group (either with a Babu-style
>>>>> "plza" file just in that group, or a configuration file under "info/").
>>>>>
>>>>> This is easily the simplest for implementation, but has no flexibility.
>>>>> Also requires users to move all the non-critical workloads out to other
>>>>> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
>>>>>
>>>>> 2) My thoughts are for a separate group that is only used to configure
>>>>> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
>>>>> are used for all tasks when in kernel mode.
>>>>>
>>>>> No context switch overhead. Has some flexibility.
>>>>>
>>>>> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
>>>>> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
>>>>> group in addition to belonging to another group than defines schemata
>>>>> resources when running in non-kernel mode.
>>>>> Tasks aren't required to be in the kernel group, in which case they
>>>>> keep the same CLOSID in both user and kernel mode. When used in this
>>>>> way there will be context switch overhead when changing between tasks
>>>>> with different kernel CLOSID/RMID.
>>>>>
>>>>> 4) Even more complex scenarios with more than one user configurable
>>>>> kernel group to give more options on resources available in the kernel.
>>>>>
>>>>>
>>>>> I had a quick pass as coding my option "2". My UI to designate the
>>>>> group to use for kernel mode is to reserve the name "kernel_group"
>>>>> when making CTRL_MON groups. Some tweaks to avoid creating the
>>>>> "tasks", "cpus", and "cpus_list" files (which might be done more
>>>>> elegantly), and "mon_groups" directory in this group.
>>>>
>>>> Should the decision of whether context switch overhead is acceptable
>>>> not be left up to the user?
>>>
>>> When someone comes up with a convincing use case to support one set of
>>> kernel resources when interrupting task A, and a different set of
>>> resources when interrupting task B, we should certainly listen.
>>
>> Absolutely. Someone can come up with such use case at any time tough. This
>> could be, and as has happened with some other resctrl interfaces, likely will be
>> after this feature has been supported for a few kernel versions. What timeline
>> should we give which users to share their use cases with us? Even if we do hear
>> from some users will that guarantee that no such use case will arise in the
>> future? Such predictions of usage are difficult for me and I thus find it simpler
>> to think of flexible ways to enable the features that we know the hardware supports.
>>
>> This does not mean that a full featured solution needs to be implemented from day 1.
>> If folks believe there are "no valid use cases" today resctrl still needs to prepare for
>> how it can grow to support full hardware capability and hardware designs in the
>> future.
>>
>> Also, please also consider not just resources for kernel work but also monitoring for
>> kernel work. I do think, for example, a reasonable use case may be to determine
>> how much memory bandwidth the kernel uses on behalf of certain tasks.
>>
>>>> I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
>>>> the needed registers will only be updated if there is a new CLOSID/RMID needed
>>>> for kernel space.
>>>
>>> Babu's RFC does this.
>>
>> Right.
>>
>>>
>>>> Are you suggesting that just this checking itself is too
>>>> expensive to justify giving user space more flexibility by fully enabling what
>>>> the hardware supports? If resctrl does draw such a line to not enable what
>>>> hardware supports it should be well justified.
>>>
>>> The check is likley light weight (as long as the variables to be
>>> compared reside in the same cache lines as the exisitng CLOSID
>>> and RMID checks). So if there is a use case for different resources
>>> when in kernel mode, then taking this path will be fine.
>>
>> Why limit this to knowing about a use case? As I understand this feature can be
>> supported in a flexible way without introducing additional context switch overhead
>> if the user prefers to use just one allocation for all kernel work. By being
>> configurable and allowing resctrl to support more use cases in the future resctrl
>> does not paint itself into a corner. This allows resctrl to grow support so that
>> the user can use all capabilities of the hardware with understanding that it will
>> increase context switch time.
>>
>> Reinette
>
> How about this idea for extensibility.
>
> Rename Babu's "plza" file to "plza_mode". Instead of just being an
> on/off switch, it may accept multiple possible requests.
If we're making global configuration choices then I think it should be
visible in a global location. It doesn't seem good to have to check all
CTRL_MON group.
>
> Humorous version:
>
> # echo "babu" > plza_mode
>
> This results in behavior of Babu's RFC. The CLOSID and RMID assigned to
> the CTRL_MON group are used when in kernel mode, but only for tasks that
> have their task-id written to the "tasks" file or for tasks in the
> default group in the "cpus" or "cpus_list" files are used to assign
> CPUs to this group.
>
> # echo "tony" > plza_mode
>
> All tasks run with the CLOSID/RMID for this group. The "tasks", "cpus" and
> "cpus_list" files and the "mon_groups" directory are removed.
>
> # echo "ben" > plza_mode"
>
> Only usable in the top-level default CTRL_MON directory. CLOSID=0/RMID=0
> are used for all tasks in kernel mode.
>
> # echo "stephane" > plza_mode
>
> The RMID for this group is freed. All tasks run in kernel mode with the
> CLOSID for this group, but use same RMID for both user and kernel.
> In addition to files removed in "tony" mode, the mon_data directory is
> removed.
For these option with a single group set as plza we could have a global
option and then just a plza marker.
>
> # echo "some-future-name" > plza_mode
>
> Somebody has a new use case. Resctrl can be extended by allowing some
> new mode.
>
> > Likely real implementation:
>
> Sub-components of each of the ideas above are encoded as a bitmask that
> is written to plza_mode. There is a file in the info/ directory listing
> which bits are supported on the current system (e.g. the "keep the same
> RMID" mode may be impractical on ARM, so it would not be listed as an
> option.)
>
> -Tony
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-19 17:03 ` Luck, Tony
@ 2026-02-19 17:45 ` Ben Horgan
2026-02-20 8:21 ` Drew Fustini
1 sibling, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-19 17:45 UTC (permalink / raw)
To: Luck, Tony, Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Drew Fustini, corbet@lwn.net,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, peterz@infradead.org,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
mgorman@suse.de, vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Tony,
On 2/19/26 17:03, Luck, Tony wrote:
>> Likely real implementation:
>>
>> Sub-components of each of the ideas above are encoded as a bitmask that
>> is written to plza_mode. There is a file in the info/ directory listing
>> which bits are supported on the current system (e.g. the "keep the same
>> RMID" mode may be impractical on ARM, so it would not be listed as an
>> option.)
>
>
> In x86 terms where control and monitor functions are independent we
> have:
>
> Control:
> 1) Use default (CLOSID==0) for kernel
> 2) Allocate just one CLOSID for kernel
> 3) Allocate many CLOSIDs for kernel
>
> Monitor:
> 1) Do not monitor kernel separately from user
> 2) Use default (RMID==0) for kernel
> 3) Allocate one RMID for kernel
> 4) Allocate many RMIDs for kernel
>
> What options are possible on ARM & RISC-V?
For ARM (MPAM) we have the same flexibility that we have for userspace
as the kernel. At EL0 (userspace) the configuration of parid/pmg is in
SYS_MPAM0_EL1 and in EL1 (kernel) it's in SYS_MPAM1_EL1. These are both
per-cpu system registers and control the partid and pmg the cpu adds to
its requests at the particular exception level (EL).
Of the above we can do all the control, 1,2,3 and all the monitor except
1, so 2,3,4. With the caveat that if more than one partid/closid is
used for the kernel then at least that number of pmg/monitors are used.
This is as the monitors are not independent (as you say), i.e. a monitor
is identified by the pair (partid, pmg).
>
> -Tony
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-19 11:06 ` Ben Horgan
@ 2026-02-19 18:12 ` Luck, Tony
2026-02-19 18:36 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Luck, Tony @ 2026-02-19 18:12 UTC (permalink / raw)
To: Ben Horgan
Cc: Reinette Chatre, Moger, Babu, Moger, Babu, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal, Zeng Heng
On Thu, Feb 19, 2026 at 11:06:14AM +0000, Ben Horgan wrote:
> If we are going to add more files to resctrl we should perhaps think of
> a reserved prefix so they won't conflict with the names of user created
> CTRL_MON groups.
Good idea. Since these new files/directories are associated with some
resctrl internals, perhaps the reserved prefix could be "resctrl_".
Plausibly someone might be naming their CTRL_MON groups with this
prefix, I'd hope that the redundancy in full pathnames like:
/sys/fs/resctrl/resctrl_xxx
would have deterred them from such a choice.
But I'm generally bad at naming things, so other suggestions welcome
(as long as this doesn't turn into a protracted bike-shedding event :-)
-Tony
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-19 10:21 ` Ben Horgan
@ 2026-02-19 18:14 ` Reinette Chatre
2026-02-23 9:48 ` Ben Horgan
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-19 18:14 UTC (permalink / raw)
To: Ben Horgan
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Ben,
On 2/19/26 2:21 AM, Ben Horgan wrote:
> On 2/17/26 18:51, Reinette Chatre wrote:
>> On 2/16/26 7:18 AM, Ben Horgan wrote:
>>> On Thu, Feb 12, 2026 at 10:37:21AM -0800, Reinette Chatre wrote:
>>>> On 2/12/26 5:55 AM, Ben Horgan wrote:
>>>>> On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
>>>>>> On 2/11/26 8:40 AM, Ben Horgan wrote:
>>>>>>> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>>
>>>>>>>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>>>>>>>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>>>>>>>> instead of CPL0 using something like "kernel" or ... ?
>>>>>>>
>>>>>>> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
>>>>>>> internally and here are a few thoughts.
>>>>>>>
>>>>>>> If the user case is just that an option run all tasks with the same closid/rmid
>>>>>>> (partid/pmg) configuration when they are running in the kernel then I'd favour a
>>>>>>> mount option. The resctrl filesytem interface doesn't need to change and
>>>>>>
>>>>>> I view mount options as an interface of last resort. Why would a mount option be needed
>>>>>> in this case? The existence of the file used to configure the feature seems sufficient?
>>>>>
>>>>> If we are taking away a closid from the user then the number of CTRL_MON groups
>>>>> that can be created changes. It seems reasonable for user-space to expect
>>>>> num_closid to be a fixed value.
>>>>
>>>> I do you see why we need to take away a CLOSID from the user. Consider a user space that
>>>
>>> Yes, just slightly simpler to take away a CLOSID but could just go with the
>>> default CLOSID is also used for the kernel. I would be ok with a file saying the
>>> mode, like the mbm_event file does for counter assignment. It slightly misleading
>>> that a configuration file is under info but necessary as we don't have another
>>> location global to the resctrl mount.
>>
>> Indeed, the "info" directory has evolved more into a "config" directory.
>>
>>>> runs with just two resource groups, for example, "high priority" and "low priority", it seems
>>>> reasonable to make it possible to let the "low priority" tasks run with "high priority"
>>>> allocations when in kernel space without needing to dedicate a new CLOSID? More reasonable
>>>> when only considering memory bandwidth allocation though.
>>>>
>>>>>
>>>>>>
>>>>>> Also ...
>>>>>>
>>>>>> I do not think resctrl should unnecessarily place constraints on what the hardware
>>>>>> features are capable of. As I understand, both PLZA and MPAM supports use case where
>>>>>> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
>>>>>> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
>>>>>> This may be because I am not familiar with all the requirements here so please do
>>>>>> help with insight on how the hardware feature is intended to be used as it relates
>>>>>> to its design.
>>>>>>
>>>>>> We have to be very careful when constraining a feature this much If resctrl does something
>>>>>> like this it essentially restricts what users could do forever.
>>>>>
>>>>> Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
>>>>> fixed kernel CLOSID/RMID configuration option might just give all we need for
>>>>> usecases we know we have and be minimally intrusive enough to not preclude a
>>>>> more featureful PLZA later when new usecases come about.
>>>>
>>>> Having ability to grow features would be ideal. I do not see how a fixed kernel CLOSID/RMID
>>>> configuration leaves room to build on top though. Could you please elaborate?
>>>
>>> If we initially go with a single new configuration file, e.g. kernel_mode, which
>>> could be "match_user" or "use_root, this would be the only initial change to the
>>> interface needed. If more usecases present themselves a new mode could be added,
>>> e.g. "configurable", and an interface to actually change the rmid/closid for the
>>> kernel could be added.
>>
>> Something like this could be a base to work from. I think only the two ("match_user" and
>> "use_root") are a bit limiting for even the initial implementation though.
>> As I understand, "use_root" implies using the allocations of the default group but
>> does not indicate what MON group (which RMID/PMG) should be used to monitor the
>> work done in kernel space. A way to specify the actual group may be needed?
>
> Yeah, I'm not sure that flexibility is strictly necessary but will make
> the interface easier to use.
I find your proposal to be a good foundation to build on. I am in process of trying out
some ideas around it for consideration and comparison to other ideas.
...
>>>> existing "tasks" file does but only supports the same CLOSID/RMID for both user
>>>> space and kernel space. To support the new hardware features where the CLOSID/RMID
>>>> can be different we cannot just change "tasks" interface and would need to keep it
>>>> backward compatible. So far I assumed that it would be ok for the "tasks" file
>>>> to essentially get new meaning as the CLOSID/RMID for just user space work, which
>>>> seems to require a second file for kernel space as a consequence? So far I have
>>>> not seen an option that does not change meaning of the "tasks" file.
>>>
>>> Would it make sense to have some new type of entries in the tasks file,
>>> e.g. k_ctrl_<pid>, k_mon_<pid> to say, in the kernel, use the closid of this
>>> CTRL_MON for this task pid or use the rmid of this CTRL_MON/MON group for this task
>>> pid? We would still probably need separate files for the cpu configuration.
>>
>> I am obligated to nack such a change to the tasks file since it would impact any
>> existing user space parsing of this file.
>>
>
> Good to know. Do you consider the format of the tasks file fully fixed?
At this point I believe it is fully fixed, yes. For this we need to consider both
how it is documented to be used and how it is used. For the former we of course have
Documentation/filesystems/resctrl.rst but for the latter it becomes difficult.
On the documentation side I also find existing documentation to be specific in how
"tasks" file should be interpreted: "Reading this file shows the list of all tasks
that belong to this group.". I do not find there to be a lot of room for changing
interpretation here.
An interface change as you suggest is reasonable for a file that is consumed by a
human - somebody can read the file and immediately notice the change and it may even
be intuitive. We know that there is a lot of tooling built around resctrl fs though
so we should evaluate impact of any interface changes on such automation. Not all of this
tooling is public so this is where things become difficult to predict the impact so
we tend to be conservative in assumptions here.
There is one open source resctrl fs tool, the "pqos" utility [1], that is getting a lot of
usage and it could be a predictor (albeit not decider) of such interface change impact.
A peek at how it parses the "tasks" file confirms that it only expects a number, see
resctrl_alloc_task_read() at https://github.com/intel/intel-cmt-cat/blob/master/lib/resctrl_alloc.c#L437
I thus expect that a user running pqos on a kernel that contains such a change to the
"tasks" file will fail which confirms changing syntax of "tasks" file should be avoided.
Reinette
[1] https://github.com/intel/intel-cmt-cat
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-19 18:12 ` Luck, Tony
@ 2026-02-19 18:36 ` Reinette Chatre
0 siblings, 0 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-19 18:36 UTC (permalink / raw)
To: Luck, Tony, Ben Horgan
Cc: Moger, Babu, Moger, Babu, Drew Fustini, corbet@lwn.net,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, peterz@infradead.org,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
mgorman@suse.de, vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal, Zeng Heng
On 2/19/26 10:12 AM, Luck, Tony wrote:
> On Thu, Feb 19, 2026 at 11:06:14AM +0000, Ben Horgan wrote:
>> If we are going to add more files to resctrl we should perhaps think of
>> a reserved prefix so they won't conflict with the names of user created
>> CTRL_MON groups.
>
> Good idea. Since these new files/directories are associated with some
> resctrl internals, perhaps the reserved prefix could be "resctrl_".
>
> Plausibly someone might be naming their CTRL_MON groups with this
> prefix, I'd hope that the redundancy in full pathnames like:
>
> /sys/fs/resctrl/resctrl_xxx
>
> would have deterred them from such a choice.
>
> But I'm generally bad at naming things, so other suggestions welcome
> (as long as this doesn't turn into a protracted bike-shedding event :-)
I do not see how we can invent a new prefix after the fact. We cannot
dictate what version of tooling user space will run and we cannot just
retroactively go change the documentation that has been in the kernel for
many versions. Looking at documentation of what user expects today is that
"Resource groups are represented as directories in the resctrl file system".
There is no flexibility here to add directories that mean something else.
Please keep existing user space in mind in proposals. [1] marches forward
ignoring previous comments from this discussion on how existing user space is
impacted without addressing those comments on why such changes are acceptable.
Reinette
[1] https://lore.kernel.org/lkml/aZXsihgl0B-o1DI6@agluck-desk3/
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-18 16:44 ` Luck, Tony
2026-02-19 17:03 ` Luck, Tony
2026-02-19 17:33 ` Ben Horgan
@ 2026-02-20 2:53 ` Reinette Chatre
2026-02-20 22:44 ` Moger, Babu
2026-02-23 10:08 ` Ben Horgan
2 siblings, 2 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-20 2:53 UTC (permalink / raw)
To: Luck, Tony, Ben Horgan, Moger, Babu, eranian@google.com
Cc: Moger, Babu, Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Tony, Ben, Babu, and Stephane,
On 2/18/26 8:44 AM, Luck, Tony wrote:
> On Tue, Feb 17, 2026 at 03:55:44PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 2/17/26 2:52 PM, Luck, Tony wrote:
>>> On Tue, Feb 17, 2026 at 02:37:49PM -0800, Reinette Chatre wrote:
>>>> Hi Tony,
>>>>
>>>> On 2/17/26 1:44 PM, Luck, Tony wrote:
>>>>>>>>> I'm not sure if this would happen in the real world or not.
>>>>>>>>
>>>>>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
>>>>>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>>>>>>>
>>>>>>> Indeed. This is all getting a bit complicated.
>>>>>>>
>>>>>>
>>>>>> ack
>>>>>
>>>>> We have several proposals so far:
>>>>>
>>>>> 1) Ben's suggestion to use the default group (either with a Babu-style
>>>>> "plza" file just in that group, or a configuration file under "info/").
>>>>>
>>>>> This is easily the simplest for implementation, but has no flexibility.
>>>>> Also requires users to move all the non-critical workloads out to other
>>>>> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
>>>>>
>>>>> 2) My thoughts are for a separate group that is only used to configure
>>>>> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
>>>>> are used for all tasks when in kernel mode.
>>>>>
>>>>> No context switch overhead. Has some flexibility.
>>>>>
>>>>> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
>>>>> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
>>>>> group in addition to belonging to another group than defines schemata
>>>>> resources when running in non-kernel mode.
>>>>> Tasks aren't required to be in the kernel group, in which case they
>>>>> keep the same CLOSID in both user and kernel mode. When used in this
>>>>> way there will be context switch overhead when changing between tasks
>>>>> with different kernel CLOSID/RMID.
>>>>>
>>>>> 4) Even more complex scenarios with more than one user configurable
>>>>> kernel group to give more options on resources available in the kernel.
>>>>>
>>>>>
>>>>> I had a quick pass as coding my option "2". My UI to designate the
>>>>> group to use for kernel mode is to reserve the name "kernel_group"
>>>>> when making CTRL_MON groups. Some tweaks to avoid creating the
>>>>> "tasks", "cpus", and "cpus_list" files (which might be done more
>>>>> elegantly), and "mon_groups" directory in this group.
>>>>
>>>> Should the decision of whether context switch overhead is acceptable
>>>> not be left up to the user?
>>>
>>> When someone comes up with a convincing use case to support one set of
>>> kernel resources when interrupting task A, and a different set of
>>> resources when interrupting task B, we should certainly listen.
>>
>> Absolutely. Someone can come up with such use case at any time tough. This
>> could be, and as has happened with some other resctrl interfaces, likely will be
>> after this feature has been supported for a few kernel versions. What timeline
>> should we give which users to share their use cases with us? Even if we do hear
>> from some users will that guarantee that no such use case will arise in the
>> future? Such predictions of usage are difficult for me and I thus find it simpler
>> to think of flexible ways to enable the features that we know the hardware supports.
>>
>> This does not mean that a full featured solution needs to be implemented from day 1.
>> If folks believe there are "no valid use cases" today resctrl still needs to prepare for
>> how it can grow to support full hardware capability and hardware designs in the
>> future.
>>
>> Also, please also consider not just resources for kernel work but also monitoring for
>> kernel work. I do think, for example, a reasonable use case may be to determine
>> how much memory bandwidth the kernel uses on behalf of certain tasks.
>>
>>>> I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
>>>> the needed registers will only be updated if there is a new CLOSID/RMID needed
>>>> for kernel space.
>>>
>>> Babu's RFC does this.
>>
>> Right.
>>
>>>
>>>> Are you suggesting that just this checking itself is too
>>>> expensive to justify giving user space more flexibility by fully enabling what
>>>> the hardware supports? If resctrl does draw such a line to not enable what
>>>> hardware supports it should be well justified.
>>>
>>> The check is likley light weight (as long as the variables to be
>>> compared reside in the same cache lines as the exisitng CLOSID
>>> and RMID checks). So if there is a use case for different resources
>>> when in kernel mode, then taking this path will be fine.
>>
>> Why limit this to knowing about a use case? As I understand this feature can be
>> supported in a flexible way without introducing additional context switch overhead
>> if the user prefers to use just one allocation for all kernel work. By being
>> configurable and allowing resctrl to support more use cases in the future resctrl
>> does not paint itself into a corner. This allows resctrl to grow support so that
>> the user can use all capabilities of the hardware with understanding that it will
>> increase context switch time.
>>
>> Reinette
>
> How about this idea for extensibility.
>
> Rename Babu's "plza" file to "plza_mode". Instead of just being an
> on/off switch, it may accept multiple possible requests.
>
> Humorous version:
>
> # echo "babu" > plza_mode
>
> This results in behavior of Babu's RFC. The CLOSID and RMID assigned to
> the CTRL_MON group are used when in kernel mode, but only for tasks that
> have their task-id written to the "tasks" file or for tasks in the
> default group in the "cpus" or "cpus_list" files are used to assign
> CPUs to this group.
>
> # echo "tony" > plza_mode
>
> All tasks run with the CLOSID/RMID for this group. The "tasks", "cpus" and
> "cpus_list" files and the "mon_groups" directory are removed.
>
> # echo "ben" > plza_mode"
>
> Only usable in the top-level default CTRL_MON directory. CLOSID=0/RMID=0
> are used for all tasks in kernel mode.
>
> # echo "stephane" > plza_mode
>
> The RMID for this group is freed. All tasks run in kernel mode with the
> CLOSID for this group, but use same RMID for both user and kernel.
> In addition to files removed in "tony" mode, the mon_data directory is
> removed.
>
> # echo "some-future-name" > plza_mode
>
> Somebody has a new use case. Resctrl can be extended by allowing some
> new mode.
>
>
> Likely real implementation:
>
> Sub-components of each of the ideas above are encoded as a bitmask that
> is written to plza_mode. There is a file in the info/ directory listing
> which bits are supported on the current system (e.g. the "keep the same
> RMID" mode may be impractical on ARM, so it would not be listed as an
> option.)
I like the idea of a global file that indicates what is supported on the
system. I find this to match Ben's proposal of a "kernel_mode" file in
info/ that looks to be a good foundation to build on. Ben also reiterated support
for this in
https://lore.kernel.org/lkml/feaa16a5-765c-4c24-9e0b-c1f4ef87a66f@arm.com/
As I mentioned in https://lore.kernel.org/lkml/5c19536b-aca0-42ce-a9d5-211fbbdbb485@intel.com/
the suggestions surrounding the per-resource group "plza_mode" file
are unexpected since they ignore earlier comments about impact on user space.
Specifically, this proposal does not address:
https://lore.kernel.org/lkml/aY3bvKeOcZ9yG686@e134344.arm.com/
https://lore.kernel.org/lkml/c779ce82-4d8a-4943-b7ec-643e5a345d6c@arm.com/
Below I aim to summarize the discussions as they relate to constraints and
requirements. I intended to capture all that has been mentioned in these
discussions so far so if I did miss something it was not intentional and
please point this out to help make this summary complete.
I hope by starting with this we can start with at least agreeing what
resctrl needs to support and how user space could interact with resctrl
to meet requirements.
After the summary of what resctrl needs to support I aim to combine
capabilities from the various proposals to meet the constraints and
requirements as I understand them so far. This aims to build on all that
has been shared until now.
Any comments are appreciated.
Summary of considerations surrounding CLOSID/RMID (PARTID/PMG) assignment for kernel work
=========================================================================================
- PLZA currently only supports global assignment (only PLZA_EN of
MSR_IA32_PQR_PLZA_ASSOC may differ on logical processors). Even so, current
speculation is that RMID_EN=0 implies that user space RMID is used to monitor
kernel work that could appear to user as "kernel mode" supporting multiple RMIDs.
https://lore.kernel.org/lkml/abb049fa-3a3d-4601-9ae3-61eeb7fd8fcf@amd.com/
- MPAM can set unique PARTID and PMG on every logical processor.
https://lore.kernel.org/lkml/fd7e0779-7e29-461d-adb6-0568a81ec59e@arm.com/
- While current PLZA only supports global assignment it may in future generations
not require MSR_IA32_PQR_PLZA_ASSOC to be same on logical processors. resctrl
thus needs to be flexible here.
https://lore.kernel.org/lkml/fa45088b-1aea-468e-8253-3238e91f76c7@amd.com/
- No equivalent feature on RISC-V.
https://lore.kernel.org/lkml/aYvP98xGoKPrDBCE@gen8/
- Impact on context switch delay is a concern and unnecessary context switch delay should
be avoided.
https://lore.kernel.org/lkml/aZThTzdxVcBkLD7P@agluck-desk3/
https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
- There is no requirement that a CLOSID/PARTID should be dedicated to kernel work.
Specifically, same CLOSID/PARTID can be used for user space and kernel work.
Also directly requested to not make kernel work CLOSID/PARTID exclusive:
https://lore.kernel.org/lkml/c8268b2a-50d7-44b4-ac3f-5ce6624599b1@arm.com/
- Only use case presented so far is related to memory bandwidth allocation where
all kernel work is done unthrottled or equivalent to highest priority tasks while
monitoring remains associated to task self.
https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
PLZA can support this with its global allocation (assuming RMID_EN=0 associates user
RMID with kernel work) To support this use case MPAM would need to be able to
change both PARTID and PMG:
https://lore.kernel.org/lkml/845587f3-4c27-46d9-83f8-6b38ccc54183@arm.com/
- Motivation of this work is to run kernel work with more/all/unthrottled
resources to avoid priority inversions. We need to be careful with such
generalization since not all resource allocations are alike yet a CLOSID/PARTID
assignment applies to all resources. For example, user may designate a cache
portion for high priority user space work and then needs to choose which cache
portions the kernel may allocate into.
https://lore.kernel.org/lkml/6293c484-ee54-46a2-b11c-e1e3c736e578@arm.com/
- If all kernel work is done using the same allocation/CLOSID/PARTID then user
needs to decide whether the kernel work's cache allocation overlaps the high
priority tasks or not. To avoid evicting high priority task work it may be
simplest for kernel allocation to not overlap high priority work but kernel work
done on behalf of high priority work would then risk eviction by low priority
work.
- When considering cache allocation it seems more flexible to have high priority
work keep its cache allocation when entering the kernel? This implies more than
one CLOSID/PARTID may need to be used for kernel work.
TBD
===
- What is impact of different controls (for example the upcoming MAX) when tasks are
spread across multiple control groups?
https://lore.kernel.org/lkml/aY3bvKeOcZ9yG686@e134344.arm.com/
How can MPAM support the "monitor kernel work with user space work" use case?
=============================================================================
This considers how MPAM could support the use case presented in:
https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
To support this use case in MPAM the control group that dictates the allocations
used in kernel work has to have monitor group(s) where this usage is tracked and user
space would need to sum the kernel and user space usage. The number of PMG may vary
and resctrl cannot assume that the kernel control group would have sufficient monitor
groups to map 1:1 with user space control and monitor groups. Mapping user space
control and monitor groups to kernel monitor groups thus seems best to be done by
user space.
Some examples:
Consider allocation and monitoring setup for user space work:
/sys/fs/resctrl <= User space default allocations
/sys/fs/resctrl/g1 <= User space allocations g1
/sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
/sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
/sys/fs/resctrl/g2 <= User space allocations g2
/sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
/sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
Having a single control group for kernel work and a system that supports
7 PMG per PARTID makes it possible to have a monitoring group for each user space
monitoring group:
(will go more into how such assignments can be made later)
/sys/fs/resctrl/kernel <= Kernel space allocations
/sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring default group
/sys/fs/resctrl/kernel/mon_groups/kernel_g1 <= Kernel space monitoring group g1
/sys/fs/resctrl/kernel/mon_groups/kernel_g1m1 <= Kernel space monitoring group g1m1
/sys/fs/resctrl/kernel/mon_groups/kernel_g1m2 <= Kernel space monitoring group g1m2
/sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring group g2
/sys/fs/resctrl/kernel/mon_groups/kernel_g2m1 <= Kernel space monitoring group g2m1
/sys/fs/resctrl/kernel/mon_groups/kernel_g2m2 <= Kernel space monitoring group g2m2
With a configuration as above user space can sum the monitoring events of the user space
groups and associated kernel space groups to obtain counts of all work done on behalf of
associated tasks.
It may not be possible to have such 1:1 relationship and user space would have to
arrange groups to match its usage. For example if system only supports two PMG per PARTID
then user space may find it best to track monitoring as below:
/sys/fs/resctrl/kernel <= Kernel space allocations
/sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
/sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
Requirements
============
Based on understanding of what PLZA and MPAM is (and could be) capable of while considering the
use case presented thus far it seems that resctrl has to:
- support global assignment of resource group for kernel work
- support per-resource group assignment for kernel work
How can resctrl support the requirements?
=========================================
New global resctrl fs files
===========================
info/kernel_mode (always visible)
info/kernel_mode_assignment (visibility and content depends on active setting in info/kernel_mode)
info/kernel_mode
================
- Displays the currently active as well as possible features available to user
space.
- Single place where user can query "kernel mode" behavior and capabilities of the
system.
- Some possible values:
- inherit_ctrl_and_mon <=== previously named "match_user", just renamed for consistency with other names
When active, kernel and user space use the same CLOSID/RMID. The current status
quo for x86.
- global_assign_ctrl_inherit_mon
When active, CLOSID/control group can be assigned for *all* (hence, "global")
kernel work while all kernel work uses same RMID as user space.
Can only be supported on architecture where CLOSID and RMID are independent.
An arch may support this in hardware (RMID_EN=0?) or this can be done by resctrl during
context switch if the RMID is independent and the context switches cost is
considered "reasonable".
This supports use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
for PLZA.
- global_assign_ctrl_assign_mon
When active the same resource group (CLOSID and RMID) can be assigned to
*all* kernel work. This could be any group, including the default group.
There may not be a use case for this but it could be useful as an intemediate
step of the mode that follow (more later).
- per_group_assign_ctrl_assign_mon
When active every resource group can be associated with another (or the same)
resource group. This association maps the resource group for user space work
to resource group for kernel work. This is similar to the "kernel_group" idea
presented in:
https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
This addresses use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
for MPAM.
- Additional values can be added as new requirements arise, for example "per_task"
assignment. Connecting visibility of info/kernel_mode_assignment to mode in
info/kernel_mode enables resctrl to later support additional modes that may require
different configuration files, potentially per-resource group like the "tasks_kernel"
(or perhaps rather "kernel_mode_tasks" to have consistent prefix for this feature)
and "cpus_kernel" ("kernel_mode_cpus"?) discussed in these threads.
User can view active and supported modes:
# cat info/kernel_mode
[inherit_ctrl_and_mon]
global_assign_ctrl_inherit_mon
global_assign_ctrl_assign_mon
User can switch modes:
# echo global_assign_ctrl_inherit_mon > kernel_mode
# cat kernel_mode
inherit_ctrl_and_mon
[global_assign_ctrl_inherit_mon]
global_assign_ctrl_assign_mon
info/kernel_mode_assignment
===========================
- Visibility depends on active mode in info/kernel_mode.
- Content depends on active mode in info/kernel_mode
- Syntax to identify resource groups can use the syntax created as part of earlier ABMC work
that supports default group https://lore.kernel.org/lkml/cover.1737577229.git.babu.moger@amd.com/
- Default CTRL_MON group and if relevant, the default MON group, can be the default
assignment when user just changes the kernel_mode without setting the assignment.
info/kernel_mode_assignment when mode is global_assign_ctrl_inherit_mon
-----------------------------------------------------------------------
- info/kernel_mode_assignment contains single value that is the name of the control group
used for all kernel work.
- CLOSID/PARTID used for kernel work is determined from the control group assigned
- default value is default CTRL_MON group
- no monitor group assignment, kernel work inherits user space RMID
- syntax is
<CTRL_MON group> with "/" meaning default.
info/kernel_mode_assignment when mode is global_assign_ctrl_assign_mon
-----------------------------------------------------------------------
- info/kernel_mode_assignment contains single value that is the name of the resource group
used for all kernel work.
- Combined CLOSID/RMID or combined PARTID/PMG is set globally to be associated with all
kernel work.
- default value is default CTRL_MON group
- syntax is
<CTRL_MON group>/MON group>/ with "//" meaning default control and default monitoring group.
info/kernel_mode_assignment when mode is per_group_assign_ctrl_assign_mon
-------------------------------------------------------------------------
- this presents the information proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
within a single file for convenience and potential optimization when user space needs to make changes.
Interface proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/ is also an option
and as an alternative a per-resource group "kernel_group" can be made visible when user space enables
this mode.
- info/kernel_mode_assignment contains a mapping of every resource group to another resource group:
<resource group for user space work>:<resource group for kernel work>
- all resource groups must be present in first field of this file
- Even though this is a "per group" setting expectation is that this will set the
kernel work CLOSID/RMID for every task. This implies that writing to this file would need
to access the tasklist_lock that, when taking for too long, may impact other parts of system.
See https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
Scenarios supported
===================
Default
-------
For x86 I understand kernel work and user work to be done with same CLOSID/RMID which
implies that info/kernel_mode can always be visible and at least display:
# cat info/kernel_mode
[inherit_ctrl_and_mon]
info/kernel_mode_assignment is not visible in this mode.
I understand MPAM may have different defaults here so would like to understand better.
Dedicated global allocations for kernel work, monitoring same for user space and kernel (PLZA)
----------------------------------------------------------------------------------------------
Possible scenario with PLZA, not MPAM (see later):
1. Create group(s) to manage allocations associated with user space work
and assign tasks/CPUs to these groups.
2. Create group to manage allocations associated with all kernel work.
- For example,
# mkdir /sys/fs/resctrl/unthrottled
- No constraints from resctrl fs on interactions with files in this group. From resctrl
fs perspective it is not "dedicated" to kernel work but just another resource group.
User space can still assign tasks/CPUs to this group that will result in this group
to be used for both kernel and user space control and monitoring. If user space wants
to dedicate a group to kernel work then they should not assign tasks/CPUs to it.
3. Set kernel mode to global_assign_ctrl_inherit_mon:
# echo global_assign_ctrl_inherit_mon > info/kernel_mode
- info/kernel_mode_assignment becomes visible and contains "/" to indicate that default
resource group is used for all kernel work
- Sets the "global" CLOSID to be used for kernel work to 0, no setting of global RMID.
4. Set control group to be used for all kernel work:
# echo unthrottled > info/kernel_mode_assignment
- Sets the "global" CLOSID to be used for kernel work to CLOSID associated with
CTRL_MON group named "unthrottled", no change to global RMID.
Dedicated global allocations and monitoring for kernel work
-----------------------------------------------------------
- Step 1 and 2 could be the same as above.
OR
2b. If there is an "unthrottled" control group that is used for both user space and kernel
allocations a separate MON group can be used to track monitoring data for kernel work.
- For example,
# mkdir /sys/fs/resctrl/unthrottled <=== All high priority work, kernel and user space
# mkdir /sys/fs/resctrl/unthrottled/mon_groups/kernel_unthrottled <= Just monitor kernel work
3. Set kernel mode to global_assign_ctrl_assign_mon:
# echo global_assign_ctrl_assign_mon > info/kernel_mode
- info/kernel_mode_assignment becomes visible and contains "//" - default CTRL_MON is
used for all kernel work allocations and monitoring
- Sets both the "global" CLOSID and RMID to be used for kernel work to 0.
4. Set control group to be used for all kernel work:
# echo unthrottled/kernel_unthrottled > info/kernel_mode_assignment
- Sets the "global" CLOSID to be used for kernel work to CLOSID associated with
CTRL_MON group named "unthrottled" and RMID used for kernel work to RMID
associated with child MON group within "unthrottled" group named "kernel_untrottled".
Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
----------------------------------------------------------------------------------------------
1. User space creates resource and monitoring groups for user tasks:
/sys/fs/resctrl <= User space default allocations
/sys/fs/resctrl/g1 <= User space allocations g1
/sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
/sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
/sys/fs/resctrl/g2 <= User space allocations g2
/sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
/sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
2. User space creates resource and monitoring groups for kernel work (system has two PMG):
/sys/fs/resctrl/kernel <= Kernel space allocations
/sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
/sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
3. Set kernel mode to per_group_assign_ctrl_assign_mon:
# echo per_group_assign_ctrl_assign_mon > info/kernel_mode
- info/kernel_mode_assignment becomes visible and contains
# cat info/kernel_mode_assignment
//://
g1//://
g1/g1m1/://
g1/g1m2/://
g2//://
g2/g2m1/://
g2/g2m2/://
- An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
user space to likely change it.
4. Set groups to be used for kernel work:
# echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
The interfaces proposed aim to maintain compatibility with existing user space tools while
adding support for all requirements expressed thus far in an efficient way. For an existing
user space tool there is no change in meaning of any existing file and no existing known
resource group files are made to disappear. There is a global configuration that lets user space
manage allocations without needing to check and configure each control group, even per-resource
group allocations can be managed from user space with a single read/write to support
making changes in most efficient way.
What do you think?
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-19 17:03 ` Luck, Tony
2026-02-19 17:45 ` Ben Horgan
@ 2026-02-20 8:21 ` Drew Fustini
1 sibling, 0 replies; 114+ messages in thread
From: Drew Fustini @ 2026-02-20 8:21 UTC (permalink / raw)
To: Luck, Tony
Cc: Reinette Chatre, Ben Horgan, Moger, Babu, Moger, Babu,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
On Thu, Feb 19, 2026 at 09:03:20AM -0800, Luck, Tony wrote:
> > Likely real implementation:
> >
> > Sub-components of each of the ideas above are encoded as a bitmask that
> > is written to plza_mode. There is a file in the info/ directory listing
> > which bits are supported on the current system (e.g. the "keep the same
> > RMID" mode may be impractical on ARM, so it would not be listed as an
> > option.)
>
>
> In x86 terms where control and monitor functions are independent we
> have:
>
> Control:
> 1) Use default (CLOSID==0) for kernel
> 2) Allocate just one CLOSID for kernel
> 3) Allocate many CLOSIDs for kernel
>
> Monitor:
> 1) Do not monitor kernel separately from user
> 2) Use default (RMID==0) for kernel
> 3) Allocate one RMID for kernel
> 4) Allocate many RMIDs for kernel
>
> What options are possible on ARM & RISC-V?
The RISC-V Ssqosid extension just adds one register to each processor
which contains a single resource control id (rcid) and a single
monitoring control id (mcid). Any switching of rcid or mcid between
kernel mode and user mode would need to be done manually by the kernel
on entry/exit.
Thanks,
Drew
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-12 3:51 ` Reinette Chatre
2026-02-12 19:09 ` Babu Moger
@ 2026-02-20 10:07 ` Ben Horgan
2026-02-20 18:39 ` Reinette Chatre
2026-02-21 0:12 ` Moger, Babu
2026-02-23 13:21 ` Fenghua Yu
2026-02-23 13:21 ` Fenghua Yu
3 siblings, 2 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-20 10:07 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, Moger, Babu, corbet, tony.luck,
Dave.Martin, james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette, Babu,
On 2/12/26 03:51, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/11/26 1:18 PM, Babu Moger wrote:
>> On 2/11/26 10:54, Reinette Chatre wrote:
>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>>>> On AMD systems, the existing MBA feature allows the user to set a bandwidth
>>>>>> limit for each QOS domain. However, multiple QOS domains share system
>>>>>> memory bandwidth as a resource. In order to ensure that system memory
>>>>>> bandwidth is not over-utilized, user must statically partition the
>>>>>> available system bandwidth between the active QOS domains. This typically
>>>>> How do you define "active" QoS Domain?
>>>> Some domains may not have any CPUs associated with that CLOSID. Active meant, I'm referring to domains that have CPUs assigned to the CLOSID.
>>> To confirm, is this then specific to assigning CPUs to resource groups via
>>> the cpus/cpus_list files? This refers to how a user needs to partition
>>> available bandwidth so I am still trying to understand the message here since
>>> users still need to do this even when CPUs are not assigned to resource
>>> groups.
>>>
>> It is not specific to CPU assignment. It applies to task assignment also.
>>
>> For example: We have 4 domains;
>>
>> # cat schemata
>> MB:0=8192;1=8192;2=8192;3=8192
>>
>> If this group has the CPUs assigned to only first two domains. Then the group has only two active domains. Then we will only update the first two domains. The MB values in other domains does not matter.
>
> I see, thank you. As I understand an "active QoS domain" is something only user
> space can designate. It may be possible for resctrl to get a sense of which QoS domains
> are "active" when only CPUs are assigned to a resource group but when it comes to task
> assignment it is user space that controls where tasks belonging to a group can be
> scheduled and thus which QoS domains are "active" or not.
>
>>
>> #echo "MB:0=8;1=8" > schemata
>>
>> # cat schemata
>> MB:0=8;1=8;2=8192;3=8192
>>
>> The combined bandwidth can go up to 16(8+8) units. Each unit is 1/8 GB.
>>
>> With GMBA, we can set the combined limit higher level and total bandwidth will not exceed GMBA limit.
>
> Thank you for the confirmation.
>
>>
>>>>>> results in system memory being under-utilized since not all QOS domains are
>>>>>> using their full bandwidth Allocation.
>>>>>>
>>>>>> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
>>>>>> for software to specify bandwidth limits for groups of threads that span
>>>>>> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
>>>>>> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
>>>>>> in GLBE control domain. Bandwidth is shared by all threads in a Class of
>>>>>> Service(COS) across every QoS domain managed by the GLBE control domain.
>>>>> How does this bandwidth allocation limit impact existing MBA? For example, if a
>>>>> system has two domains (A and B) that user space separately sets MBA
>>>>> allocations for while also placing both domains within a "GLBE control domain"
>>>>> with a different allocation, does the individual MBA allocations still matter?
>>>> Yes. Both ceilings are enforced at their respective levels.
>>>> The MBA ceiling is applied at the QoS domain level.
>>>> The GLBE ceiling is applied at the GLBE control domain level.
>>>> If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
>>> It sounds as though MBA and GMBA/GLBE operates within the same parameters wrt
>>> the limits but in examples in this series they have different limits. For example,
>>> in the documentation patch [1] there is this:
>>>
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=4096;1=4096;2=4096;3=4096
>>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>>
>>> followed up with what it will look like in new generation [2]:
>>>
>>> GMB:0=4096;1=4096;2=4096;3=4096
>>> MB:0=8192;1=8192;2=8192;3=8192
>>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>>
>>> In both examples the per-domain MB ceiling is higher than the global GMB ceiling. With
>>> above showing defaults and you state "If the MBA ceiling exceeds the GLBE ceiling,
>>> the effective MBA limit will be capped by the GLBE ceiling." - does this mean that
>>> MB ceiling can never be higher than GMB ceiling as shown in the examples?
>>
>> That is correct. There is one more information here. The MB unit is in 1/8 GB and GMB unit is 1GB. I have added that in documentation in patch 4.
>
> ah - right. I did not take the different units into account.
>
>>
>> The GMB limit defaults to max value 4096 (bit 12 set) when the new group is created. Meaning GMB limit does not apply by default.
>>
>> When setting the limits, it should be set to same value in all the domains in GMB control domain. Having different value in each domain results in unexpected behavior.
>>
>>>
>>> Another question, when setting aside possible differences between MB and GMB.
>>>
>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>
>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>> same:
>>>
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=2048;1=2048;2=2048;3=2048
>>>
>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>> MB limit:
>>> # echo "GMB:0=8;2=8" > schemata
>>> # cat schemata
>>> GMB:0=8;1=2048;2=8;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>>
>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>
> Thank you for confirming.
>
>>
>>
>>> ... and then when user space resets GMB the MB can reset like ...
>>>
>>> # echo "GMB:0=2048;2=2048" > schemata
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=2048;1=2048;2=2048;3=2048
>>>
>>> if I understand correctly this will only apply if the MB limit was never set so
>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>>>
>>> # echo "GMB:0=8;2=8" > schemata
>>> # cat schemata
>>> GMB:0=8;1=2048;2=8;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>>>
>>> # echo "GMB:0=2048;2=2048" > schemata
>>> # cat schemata
>>> GMB:0=2048;1=2048;2=2048;3=2048
>>> MB:0=8;1=2048;2=8;3=2048
>>>
>>> What would be most intuitive way for user to interact with the interfaces?
>>
>> I see that you are trying to display the effective behaviors above.
>
> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
> what would be a reasonable expectation from resctrl be during these interactions.
>
>>
>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>
> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
> settings may cause confusion?
>
>>
>> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
>
> Yes, this will require resctrl to maintain more state.
>
> Documenting behavior is an option but I think we should first consider if there are things
> resctrl can do to make the interface intuitive to use.
>
>>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>>
>>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>>> I hope this clarifies your question.
>>> No. When enumerating the features the number of CLOSID supported by each is
>>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
>> No. There is not such scenario.
>>>
>>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>>> scenarios where some resource groups can support global AND per-domain limits while other
>>> resource groups can just support global or just support per-domain limits. Is this correct?
>>
>> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
>
> It is not a concern to have different CLOSIDs between resources that are actually different,
> for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
> allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
> challenge though. Would it be possible to have a snippet in the spec that explicitly states
> that MB and GMB will always enumerate with the same number of CLOSIDs?
>
> Please see below where I will try to support this request more clearly and you can decide if
> it is reasonable.
>
>>>>> can be seen as a single "resource" that can be allocated differently based on
>>>>> the various schemata associated with that resource. This currently has a
>>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>>> may be something that we can reconsider?
>>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>>> The new approach is not final so please provide feedback to help improve it so
>>> that the features you are enabling can be supported well.
>>
>> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
>
> It benefits all architectures.
>
> There are two parts to the current proposals.
>
> Part 1: Generic schema description
> I believe there is consensus on this approach. This is actually something that is long
> overdue and something like this would have been a great to have with the initial AMD
> enabling. With the generic schema description forming part of resctrl the user can learn
> from resctrl how to interact with the schemata file instead of relying on external information
> and documentation.
>
> For example, on an Intel system that uses percentage based proportional allocation for memory
> bandwidth the new resctrl files will display:
> info/MB/resource_schemata/MB/type:scalar linear
> info/MB/resource_schemata/MB/unit:all
> info/MB/resource_schemata/MB/scale:1
> info/MB/resource_schemata/MB/resolution:100
> info/MB/resource_schemata/MB/tolerance:0
> info/MB/resource_schemata/MB/max:100
> info/MB/resource_schemata/MB/min:10
>
>
> On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
> info/MB/resource_schemata/MB/type:scalar linear
> info/MB/resource_schemata/MB/unit:GBps
> info/MB/resource_schemata/MB/scale:1
> info/MB/resource_schemata/MB/resolution:8
> info/MB/resource_schemata/MB/tolerance:0
> info/MB/resource_schemata/MB/max:2048
> info/MB/resource_schemata/MB/min:1
>
> Having such interface will be helpful today. Users do not need to first figure out
> whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
> before interacting with resctrl. resctrl will be the generic interface it intends to be.
>
> Part 2: Supporting multiple controls for a single resource
> This is a new feature on which there also appears to be consensus that is needed by MPAM and
> Intel RDT where it is possible to use different controls for the same resource. For example,
> there can be a minimum and maximum control associated with the memory bandwidth resource.
>
> For example,
> info/
> └─ MB/
> └─ resource_schemata/
> ├─ MB/
> ├─ MB_MIN/
> ├─ MB_MAX/
> ┆
>
>
> Here is where the big question comes in for GLBE - is this actually a new resource
> for which resctrl needs to add interfaces to manage its allocation, or is it instead
> an additional control associated with the existing memory bandwith resource?
>
> For me things are actually pointing to GLBE not being a new resource but instead being
> a new control for the existing memory bandwidth resource.
>
> I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
> done in this series but when considering it as an actual unique resource does not seem
> appropriate since resctrl already has a "memory bandwidth" resource. User space expects
> to find all the resources that it can allocate in info/ - I do not think it is correct
> to have two separate directories/resources for memory bandwidth here.
>
> What if, instead, it looks something like:
>
> info/
> └── MB/
> └── resource_schemata/
> ├── GMB/
> │ ├── max:4096
> │ ├── min:1
> │ ├── resolution:1
> │ ├── scale:1
> │ ├── tolerance:0
> │ ├── type:scalar linear
> │ └── unit:GBps
> └── MB/
> ├── max:8192
> ├── min:1
> ├── resolution:8
> ├── scale:1
> ├── tolerance:0
> ├── type:scalar linear
> └── unit:GBps
>
> With an interface like above GMB is just another control/schema used to allocate the
> existing memory bandwidth resource. With the planned files it is possible to express the
> different maximums and units used by the MB and GMB schema. Users no longer need to
> dig for the unit information in the docs, it is available in the interface.
>
> Doing something like this does depend on GLBE supporting the same number of CLOSIDs
> as MB, which seems to be how this will be implemented. If there is indeed a confirmation
> of this from AMD architecture then we can do something like this in resctrl.
I haven't fully understood what GLBE is but in MPAM we have an optional
feature in MSC (MPAM devices) called partid narrowing. For some MSC
there are limited controls and the incoming partid is mapped to an
effective partid using a mapping. This mapping is software controllable.
Dave (with Shaopeng and Zeng) has a proposal to use this to use partid
bits as pmg bits, [1]. This usage would have to be opt-in as it changes
the number of closid/rmid that MPAM presents to resctrl. If however, the
user doesn't use that scheme then the controls could be presented as
controls for groups of closid in resctrl. Is this similar/usable with
the same interface as GLBE or have I misunderstood?
[1]
https://lore.kernel.org/linux-arm-kernel/20241212154000.330467-1-Dave.Martin@arm.com/
>
> There is a "part 3" to the proposals that attempts to address the new requirement where
> some of the controls allocate at a different scope while also requiring monitoring at
> that new scope. After learning more about GLBE this does not seem relevant to GLBE but is
> something to return to for the "MPAM CPU-less" work. We could already prepare for this
> by adding the new "scope" schema property though.
>
>
> Reinette
>
>>
>> Thanks
>>
>> Babu
>>
>>
>>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/d58f70592a4ce89e744e7378e49d5a36be3fd05e.1769029977.git.babu.moger@amd.com/
>>> [2] https://lore.kernel.org/lkml/e0c79c53-489d-47bf-89b9-f1bb709316c6@amd.com/
>>>
>
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-20 10:07 ` Ben Horgan
@ 2026-02-20 18:39 ` Reinette Chatre
2026-02-23 9:29 ` Ben Horgan
2026-02-21 0:12 ` Moger, Babu
1 sibling, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-20 18:39 UTC (permalink / raw)
To: Ben Horgan, Babu Moger, Moger, Babu, corbet, tony.luck,
Dave.Martin, james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Ben,
On 2/20/26 2:07 AM, Ben Horgan wrote:
>
> I haven't fully understood what GLBE is but in MPAM we have an optional
> feature in MSC (MPAM devices) called partid narrowing. For some MSC
> there are limited controls and the incoming partid is mapped to an
> effective partid using a mapping. This mapping is software controllable.
> Dave (with Shaopeng and Zeng) has a proposal to use this to use partid
> bits as pmg bits, [1]. This usage would have to be opt-in as it changes
> the number of closid/rmid that MPAM presents to resctrl. If however, the
> user doesn't use that scheme then the controls could be presented as
> controls for groups of closid in resctrl. Is this similar/usable with
> the same interface as GLBE or have I misunderstood?
>
> [1]
> https://lore.kernel.org/linux-arm-kernel/20241212154000.330467-1-Dave.Martin@arm.com/
On a high level these look like different capabilities to me but I look forward to
hear from others to understand where I may be wrong.
As I understand the feature you refer to is a way in which MPAM can increase the
number of hardware monitoring IDs available(*). It does so by using the PARTID
narrowing feature while taking advantage of the fact that PARTID for filtering
resource monitors is always a "request PARTID". In itself I understand the PARTID
narrowing feature to manage how resource allocation of a *single* "MPAM component"
is managed.
On the other hand I see GLBE as a feature that essentially allows the scope of
allocation to span multiple domains/components.
As I see it, applying GLBE to MPAM would need the capability to, for example,
set a memory bandwidth MAX that is shared across multiple MPAM components.
Reinette
* as a sidenote it is not clear to me why this would require an opt-in since
there only seems benefits to this.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-20 2:53 ` Reinette Chatre
@ 2026-02-20 22:44 ` Moger, Babu
2026-02-23 17:12 ` Reinette Chatre
2026-02-23 10:08 ` Ben Horgan
1 sibling, 1 reply; 114+ messages in thread
From: Moger, Babu @ 2026-02-20 22:44 UTC (permalink / raw)
To: Reinette Chatre, Luck, Tony, Ben Horgan, Moger, Babu,
eranian@google.com
Cc: Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Reinette,
Thanks for the detailed summary and proposal.
On 2/19/2026 8:53 PM, Reinette Chatre wrote:
> Hi Tony, Ben, Babu, and Stephane,
>
> On 2/18/26 8:44 AM, Luck, Tony wrote:
>> On Tue, Feb 17, 2026 at 03:55:44PM -0800, Reinette Chatre wrote:
>>> Hi Tony,
>>>
>>> On 2/17/26 2:52 PM, Luck, Tony wrote:
>>>> On Tue, Feb 17, 2026 at 02:37:49PM -0800, Reinette Chatre wrote:
>>>>> Hi Tony,
>>>>>
>>>>> On 2/17/26 1:44 PM, Luck, Tony wrote:
>>>>>>>>>> I'm not sure if this would happen in the real world or not.
>>>>>>>>>
>>>>>>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
>>>>>>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>>>>>>>>
>>>>>>>> Indeed. This is all getting a bit complicated.
>>>>>>>>
>>>>>>>
>>>>>>> ack
>>>>>>
>>>>>> We have several proposals so far:
>>>>>>
>>>>>> 1) Ben's suggestion to use the default group (either with a Babu-style
>>>>>> "plza" file just in that group, or a configuration file under "info/").
>>>>>>
>>>>>> This is easily the simplest for implementation, but has no flexibility.
>>>>>> Also requires users to move all the non-critical workloads out to other
>>>>>> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
>>>>>>
>>>>>> 2) My thoughts are for a separate group that is only used to configure
>>>>>> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
>>>>>> are used for all tasks when in kernel mode.
>>>>>>
>>>>>> No context switch overhead. Has some flexibility.
>>>>>>
>>>>>> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
>>>>>> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
>>>>>> group in addition to belonging to another group than defines schemata
>>>>>> resources when running in non-kernel mode.
>>>>>> Tasks aren't required to be in the kernel group, in which case they
>>>>>> keep the same CLOSID in both user and kernel mode. When used in this
>>>>>> way there will be context switch overhead when changing between tasks
>>>>>> with different kernel CLOSID/RMID.
>>>>>>
>>>>>> 4) Even more complex scenarios with more than one user configurable
>>>>>> kernel group to give more options on resources available in the kernel.
>>>>>>
>>>>>>
>>>>>> I had a quick pass as coding my option "2". My UI to designate the
>>>>>> group to use for kernel mode is to reserve the name "kernel_group"
>>>>>> when making CTRL_MON groups. Some tweaks to avoid creating the
>>>>>> "tasks", "cpus", and "cpus_list" files (which might be done more
>>>>>> elegantly), and "mon_groups" directory in this group.
>>>>>
>>>>> Should the decision of whether context switch overhead is acceptable
>>>>> not be left up to the user?
>>>>
>>>> When someone comes up with a convincing use case to support one set of
>>>> kernel resources when interrupting task A, and a different set of
>>>> resources when interrupting task B, we should certainly listen.
>>>
>>> Absolutely. Someone can come up with such use case at any time tough. This
>>> could be, and as has happened with some other resctrl interfaces, likely will be
>>> after this feature has been supported for a few kernel versions. What timeline
>>> should we give which users to share their use cases with us? Even if we do hear
>>> from some users will that guarantee that no such use case will arise in the
>>> future? Such predictions of usage are difficult for me and I thus find it simpler
>>> to think of flexible ways to enable the features that we know the hardware supports.
>>>
>>> This does not mean that a full featured solution needs to be implemented from day 1.
>>> If folks believe there are "no valid use cases" today resctrl still needs to prepare for
>>> how it can grow to support full hardware capability and hardware designs in the
>>> future.
>>>
>>> Also, please also consider not just resources for kernel work but also monitoring for
>>> kernel work. I do think, for example, a reasonable use case may be to determine
>>> how much memory bandwidth the kernel uses on behalf of certain tasks.
>>>
>>>>> I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
>>>>> the needed registers will only be updated if there is a new CLOSID/RMID needed
>>>>> for kernel space.
>>>>
>>>> Babu's RFC does this.
>>>
>>> Right.
>>>
>>>>
>>>>> Are you suggesting that just this checking itself is too
>>>>> expensive to justify giving user space more flexibility by fully enabling what
>>>>> the hardware supports? If resctrl does draw such a line to not enable what
>>>>> hardware supports it should be well justified.
>>>>
>>>> The check is likley light weight (as long as the variables to be
>>>> compared reside in the same cache lines as the exisitng CLOSID
>>>> and RMID checks). So if there is a use case for different resources
>>>> when in kernel mode, then taking this path will be fine.
>>>
>>> Why limit this to knowing about a use case? As I understand this feature can be
>>> supported in a flexible way without introducing additional context switch overhead
>>> if the user prefers to use just one allocation for all kernel work. By being
>>> configurable and allowing resctrl to support more use cases in the future resctrl
>>> does not paint itself into a corner. This allows resctrl to grow support so that
>>> the user can use all capabilities of the hardware with understanding that it will
>>> increase context switch time.
>>>
>>> Reinette
>>
>> How about this idea for extensibility.
>>
>> Rename Babu's "plza" file to "plza_mode". Instead of just being an
>> on/off switch, it may accept multiple possible requests.
>>
>> Humorous version:
>>
>> # echo "babu" > plza_mode
>>
>> This results in behavior of Babu's RFC. The CLOSID and RMID assigned to
>> the CTRL_MON group are used when in kernel mode, but only for tasks that
>> have their task-id written to the "tasks" file or for tasks in the
>> default group in the "cpus" or "cpus_list" files are used to assign
>> CPUs to this group.
>>
>> # echo "tony" > plza_mode
>>
>> All tasks run with the CLOSID/RMID for this group. The "tasks", "cpus" and
>> "cpus_list" files and the "mon_groups" directory are removed.
>>
>> # echo "ben" > plza_mode"
>>
>> Only usable in the top-level default CTRL_MON directory. CLOSID=0/RMID=0
>> are used for all tasks in kernel mode.
>>
>> # echo "stephane" > plza_mode
>>
>> The RMID for this group is freed. All tasks run in kernel mode with the
>> CLOSID for this group, but use same RMID for both user and kernel.
>> In addition to files removed in "tony" mode, the mon_data directory is
>> removed.
>>
>> # echo "some-future-name" > plza_mode
>>
>> Somebody has a new use case. Resctrl can be extended by allowing some
>> new mode.
>>
>>
>> Likely real implementation:
>>
>> Sub-components of each of the ideas above are encoded as a bitmask that
>> is written to plza_mode. There is a file in the info/ directory listing
>> which bits are supported on the current system (e.g. the "keep the same
>> RMID" mode may be impractical on ARM, so it would not be listed as an
>> option.)
>
> I like the idea of a global file that indicates what is supported on the
> system. I find this to match Ben's proposal of a "kernel_mode" file in
> info/ that looks to be a good foundation to build on. Ben also reiterated support
> for this in
> https://lore.kernel.org/lkml/feaa16a5-765c-4c24-9e0b-c1f4ef87a66f@arm.com/
>
> As I mentioned in https://lore.kernel.org/lkml/5c19536b-aca0-42ce-a9d5-211fbbdbb485@intel.com/
> the suggestions surrounding the per-resource group "plza_mode" file
> are unexpected since they ignore earlier comments about impact on user space.
> Specifically, this proposal does not address:
> https://lore.kernel.org/lkml/aY3bvKeOcZ9yG686@e134344.arm.com/
> https://lore.kernel.org/lkml/c779ce82-4d8a-4943-b7ec-643e5a345d6c@arm.com/
>
> Below I aim to summarize the discussions as they relate to constraints and
> requirements. I intended to capture all that has been mentioned in these
> discussions so far so if I did miss something it was not intentional and
> please point this out to help make this summary complete.
>
> I hope by starting with this we can start with at least agreeing what
> resctrl needs to support and how user space could interact with resctrl
> to meet requirements.
>
> After the summary of what resctrl needs to support I aim to combine
> capabilities from the various proposals to meet the constraints and
> requirements as I understand them so far. This aims to build on all that
> has been shared until now.
>
> Any comments are appreciated.
>
> Summary of considerations surrounding CLOSID/RMID (PARTID/PMG) assignment for kernel work
> =========================================================================================
>
> - PLZA currently only supports global assignment (only PLZA_EN of
> MSR_IA32_PQR_PLZA_ASSOC may differ on logical processors). Even so, current
> speculation is that RMID_EN=0 implies that user space RMID is used to monitor
> kernel work that could appear to user as "kernel mode" supporting multiple RMIDs.
> https://lore.kernel.org/lkml/abb049fa-3a3d-4601-9ae3-61eeb7fd8fcf@amd.com/
Yes. RMID_EN=0 means dont use separate RMID for plza.
>
> - MPAM can set unique PARTID and PMG on every logical processor.
> https://lore.kernel.org/lkml/fd7e0779-7e29-461d-adb6-0568a81ec59e@arm.com/
>
> - While current PLZA only supports global assignment it may in future generations
> not require MSR_IA32_PQR_PLZA_ASSOC to be same on logical processors. resctrl
> thus needs to be flexible here.
> https://lore.kernel.org/lkml/fa45088b-1aea-468e-8253-3238e91f76c7@amd.com/
>
Good point.
> - No equivalent feature on RISC-V.
> https://lore.kernel.org/lkml/aYvP98xGoKPrDBCE@gen8/
>
> - Impact on context switch delay is a concern and unnecessary context switch delay should
> be avoided.
> https://lore.kernel.org/lkml/aZThTzdxVcBkLD7P@agluck-desk3/
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>
> - There is no requirement that a CLOSID/PARTID should be dedicated to kernel work.
> Specifically, same CLOSID/PARTID can be used for user space and kernel work.
> Also directly requested to not make kernel work CLOSID/PARTID exclusive:
> https://lore.kernel.org/lkml/c8268b2a-50d7-44b4-ac3f-5ce6624599b1@arm.com/
>
> - Only use case presented so far is related to memory bandwidth allocation where
> all kernel work is done unthrottled or equivalent to highest priority tasks while
> monitoring remains associated to task self.
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
> PLZA can support this with its global allocation (assuming RMID_EN=0 associates user
> RMID with kernel work) To support this use case MPAM would need to be able to
> change both PARTID and PMG:
> https://lore.kernel.org/lkml/845587f3-4c27-46d9-83f8-6b38ccc54183@arm.com/
>
> - Motivation of this work is to run kernel work with more/all/unthrottled
> resources to avoid priority inversions. We need to be careful with such
> generalization since not all resource allocations are alike yet a CLOSID/PARTID
> assignment applies to all resources. For example, user may designate a cache
> portion for high priority user space work and then needs to choose which cache
> portions the kernel may allocate into.
> https://lore.kernel.org/lkml/6293c484-ee54-46a2-b11c-e1e3c736e578@arm.com/
> - If all kernel work is done using the same allocation/CLOSID/PARTID then user
> needs to decide whether the kernel work's cache allocation overlaps the high
> priority tasks or not. To avoid evicting high priority task work it may be
> simplest for kernel allocation to not overlap high priority work but kernel work
> done on behalf of high priority work would then risk eviction by low priority
> work.
> - When considering cache allocation it seems more flexible to have high priority
> work keep its cache allocation when entering the kernel? This implies more than
> one CLOSID/PARTID may need to be used for kernel work.
>
>
> TBD
> ===
> - What is impact of different controls (for example the upcoming MAX) when tasks are
> spread across multiple control groups?
> https://lore.kernel.org/lkml/aY3bvKeOcZ9yG686@e134344.arm.com/
>
> How can MPAM support the "monitor kernel work with user space work" use case?
> =============================================================================
> This considers how MPAM could support the use case presented in:
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>
> To support this use case in MPAM the control group that dictates the allocations
> used in kernel work has to have monitor group(s) where this usage is tracked and user
> space would need to sum the kernel and user space usage. The number of PMG may vary
> and resctrl cannot assume that the kernel control group would have sufficient monitor
> groups to map 1:1 with user space control and monitor groups. Mapping user space
> control and monitor groups to kernel monitor groups thus seems best to be done by
> user space.
>
> Some examples:
> Consider allocation and monitoring setup for user space work:
> /sys/fs/resctrl <= User space default allocations
> /sys/fs/resctrl/g1 <= User space allocations g1
> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
> /sys/fs/resctrl/g2 <= User space allocations g2
> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>
> Having a single control group for kernel work and a system that supports
> 7 PMG per PARTID makes it possible to have a monitoring group for each user space
> monitoring group:
> (will go more into how such assignments can be made later)
>
> /sys/fs/resctrl/kernel <= Kernel space allocations
> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring default group
> /sys/fs/resctrl/kernel/mon_groups/kernel_g1 <= Kernel space monitoring group g1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g1m1 <= Kernel space monitoring group g1m1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g1m2 <= Kernel space monitoring group g1m2
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring group g2
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2m1 <= Kernel space monitoring group g2m1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2m2 <= Kernel space monitoring group g2m2
>
> With a configuration as above user space can sum the monitoring events of the user space
> groups and associated kernel space groups to obtain counts of all work done on behalf of
> associated tasks.
>
> It may not be possible to have such 1:1 relationship and user space would have to
> arrange groups to match its usage. For example if system only supports two PMG per PARTID
> then user space may find it best to track monitoring as below:
> /sys/fs/resctrl/kernel <= Kernel space allocations
> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
>
>
> Requirements
> ============
> Based on understanding of what PLZA and MPAM is (and could be) capable of while considering the
> use case presented thus far it seems that resctrl has to:
> - support global assignment of resource group for kernel work
> - support per-resource group assignment for kernel work
>
Yes. That is correct.
> How can resctrl support the requirements?
> =========================================
>
> New global resctrl fs files
> ===========================
> info/kernel_mode (always visible)
> info/kernel_mode_assignment (visibility and content depends on active setting in info/kernel_mode)
Probably good idea to drop "assign" for this work. We already have
mbm_assign mode and related work.
info/kernel_mode_assoc or info/kernel_mode_association? Or We can wait
later to rename appropriately.
>
> info/kernel_mode
> ================
> - Displays the currently active as well as possible features available to user
> space.
> - Single place where user can query "kernel mode" behavior and capabilities of the
> system.
> - Some possible values:
> - inherit_ctrl_and_mon <=== previously named "match_user", just renamed for consistency with other names
> When active, kernel and user space use the same CLOSID/RMID. The current status
> quo for x86.
> - global_assign_ctrl_inherit_mon
> When active, CLOSID/control group can be assigned for *all* (hence, "global")
> kernel work while all kernel work uses same RMID as user space.
> Can only be supported on architecture where CLOSID and RMID are independent.
> An arch may support this in hardware (RMID_EN=0?) or this can be done by resctrl during
> context switch if the RMID is independent and the context switches cost is
> considered "reasonable".
> This supports use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
> for PLZA.
> - global_assign_ctrl_assign_mon
> When active the same resource group (CLOSID and RMID) can be assigned to
> *all* kernel work. This could be any group, including the default group.
> There may not be a use case for this but it could be useful as an intemediate
> step of the mode that follow (more later).
> - per_group_assign_ctrl_assign_mon
> When active every resource group can be associated with another (or the same)
> resource group. This association maps the resource group for user space work
> to resource group for kernel work. This is similar to the "kernel_group" idea
> presented in:
> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
> This addresses use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
> for MPAM.
All these new names and related information will go in global structure.
Something like this..
Struct kern_mode {
enum assoc_mode;
struct rdtgroup *k_rdtgrp;
...
};
Not sure what other information will be required here. Will know once I
stared working on it.
This structure will be updated based on user echo's in "kernel_mode" and
"kernel_mode_assignment".
> - Additional values can be added as new requirements arise, for example "per_task"
> assignment. Connecting visibility of info/kernel_mode_assignment to mode in
> info/kernel_mode enables resctrl to later support additional modes that may require
> different configuration files, potentially per-resource group like the "tasks_kernel"
> (or perhaps rather "kernel_mode_tasks" to have consistent prefix for this feature)
> and "cpus_kernel" ("kernel_mode_cpus"?) discussed in these threads.
So, per resource group file "kernel_mode_tasks" and "kernel_mode_cpus"
are not required right now. Correct?
>
> User can view active and supported modes:
>
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
> global_assign_ctrl_inherit_mon
> global_assign_ctrl_assign_mon
>
> User can switch modes:
> # echo global_assign_ctrl_inherit_mon > kernel_mode
> # cat kernel_mode
> inherit_ctrl_and_mon
> [global_assign_ctrl_inherit_mon]
> global_assign_ctrl_assign_mon
>
>
> info/kernel_mode_assignment
> ===========================
> - Visibility depends on active mode in info/kernel_mode.
> - Content depends on active mode in info/kernel_mode
> - Syntax to identify resource groups can use the syntax created as part of earlier ABMC work
> that supports default group https://lore.kernel.org/lkml/cover.1737577229.git.babu.moger@amd.com/
> - Default CTRL_MON group and if relevant, the default MON group, can be the default
> assignment when user just changes the kernel_mode without setting the assignment.
>
> info/kernel_mode_assignment when mode is global_assign_ctrl_inherit_mon
> -----------------------------------------------------------------------
> - info/kernel_mode_assignment contains single value that is the name of the control group
> used for all kernel work.
> - CLOSID/PARTID used for kernel work is determined from the control group assigned
> - default value is default CTRL_MON group
> - no monitor group assignment, kernel work inherits user space RMID
> - syntax is
> <CTRL_MON group> with "/" meaning default.
>
> info/kernel_mode_assignment when mode is global_assign_ctrl_assign_mon
> -----------------------------------------------------------------------
> - info/kernel_mode_assignment contains single value that is the name of the resource group
> used for all kernel work.
> - Combined CLOSID/RMID or combined PARTID/PMG is set globally to be associated with all
> kernel work.
> - default value is default CTRL_MON group
> - syntax is
> <CTRL_MON group>/MON group>/ with "//" meaning default control and default monitoring group.
>
> info/kernel_mode_assignment when mode is per_group_assign_ctrl_assign_mon
> -------------------------------------------------------------------------
> - this presents the information proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
> within a single file for convenience and potential optimization when user space needs to make changes.
> Interface proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/ is also an option
> and as an alternative a per-resource group "kernel_group" can be made visible when user space enables
> this mode.
> - info/kernel_mode_assignment contains a mapping of every resource group to another resource group:
> <resource group for user space work>:<resource group for kernel work>
> - all resource groups must be present in first field of this file
> - Even though this is a "per group" setting expectation is that this will set the
> kernel work CLOSID/RMID for every task. This implies that writing to this file would need
> to access the tasklist_lock that, when taking for too long, may impact other parts of system.
> See https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
This mode is currently not supported in AMD PLZA implementation. But we
have to keep the options open for future enhancement for MPAM. I am
still learning on MPM requirement.
>
> Scenarios supported
> ===================
>
> Default
> -------
> For x86 I understand kernel work and user work to be done with same CLOSID/RMID which
> implies that info/kernel_mode can always be visible and at least display:
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
>
> info/kernel_mode_assignment is not visible in this mode.
>
> I understand MPAM may have different defaults here so would like to understand better.
>
> Dedicated global allocations for kernel work, monitoring same for user space and kernel (PLZA)
> ----------------------------------------------------------------------------------------------
> Possible scenario with PLZA, not MPAM (see later):
> 1. Create group(s) to manage allocations associated with user space work
> and assign tasks/CPUs to these groups.
> 2. Create group to manage allocations associated with all kernel work.
> - For example,
> # mkdir /sys/fs/resctrl/unthrottled
> - No constraints from resctrl fs on interactions with files in this group. From resctrl
> fs perspective it is not "dedicated" to kernel work but just another resource group.
That is correct. We dont need to handle the group special for
kernel_mode while creating the group. However, there will some handling
required when kernel_mode group is deleted. We need to move the
tasks/cpus back to default group and update the global kernel_mode
structure.
> User space can still assign tasks/CPUs to this group that will result in this group
> to be used for both kernel and user space control and monitoring. If user space wants
> to dedicate a group to kernel work then they should not assign tasks/CPUs to it.
> 3. Set kernel mode to global_assign_ctrl_inherit_mon:
> # echo global_assign_ctrl_inherit_mon > info/kernel_mode
> - info/kernel_mode_assignment becomes visible and contains "/" to indicate that default
> resource group is used for all kernel work
> - Sets the "global" CLOSID to be used for kernel work to 0, no setting of global RMID.
> 4. Set control group to be used for all kernel work:
> # echo unthrottled > info/kernel_mode_assignment
> - Sets the "global" CLOSID to be used for kernel work to CLOSID associated with
> CTRL_MON group named "unthrottled", no change to global RMID.
>
Ok. Sounds good.
>
> Dedicated global allocations and monitoring for kernel work
> -----------------------------------------------------------
> - Step 1 and 2 could be the same as above.
> OR
> 2b. If there is an "unthrottled" control group that is used for both user space and kernel
> allocations a separate MON group can be used to track monitoring data for kernel work.
> - For example,
> # mkdir /sys/fs/resctrl/unthrottled <=== All high priority work, kernel and user space
> # mkdir /sys/fs/resctrl/unthrottled/mon_groups/kernel_unthrottled <= Just monitor kernel work
>
> 3. Set kernel mode to global_assign_ctrl_assign_mon:
> # echo global_assign_ctrl_assign_mon > info/kernel_mode
> - info/kernel_mode_assignment becomes visible and contains "//" - default CTRL_MON is
> used for all kernel work allocations and monitoring
> - Sets both the "global" CLOSID and RMID to be used for kernel work to 0.
> 4. Set control group to be used for all kernel work:
> # echo unthrottled/kernel_unthrottled > info/kernel_mode_assignment
> - Sets the "global" CLOSID to be used for kernel work to CLOSID associated with
> CTRL_MON group named "unthrottled" and RMID used for kernel work to RMID
> associated with child MON group within "unthrottled" group named "kernel_untrottled".
>
ok. Sounds good.
> Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
> ----------------------------------------------------------------------------------------------
> 1. User space creates resource and monitoring groups for user tasks:
> /sys/fs/resctrl <= User space default allocations
> /sys/fs/resctrl/g1 <= User space allocations g1
> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
> /sys/fs/resctrl/g2 <= User space allocations g2
> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>
> 2. User space creates resource and monitoring groups for kernel work (system has two PMG):
> /sys/fs/resctrl/kernel <= Kernel space allocations
> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
> 3. Set kernel mode to per_group_assign_ctrl_assign_mon:
> # echo per_group_assign_ctrl_assign_mon > info/kernel_mode
> - info/kernel_mode_assignment becomes visible and contains
> # cat info/kernel_mode_assignment
> //://
> g1//://
> g1/g1m1/://
> g1/g1m2/://
> g2//://
> g2/g2m1/://
> g2/g2m2/://
> - An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
> similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
> avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
> user space to likely change it.
> 4. Set groups to be used for kernel work:
> # echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
>
Currently, this is not supported in AMD's PLZA implimentation. But we
need to keep this option open for MPAM.
> The interfaces proposed aim to maintain compatibility with existing user space tools while
> adding support for all requirements expressed thus far in an efficient way. For an existing
> user space tool there is no change in meaning of any existing file and no existing known
> resource group files are made to disappear. There is a global configuration that lets user space
> manage allocations without needing to check and configure each control group, even per-resource
> group allocations can be managed from user space with a single read/write to support
> making changes in most efficient way.
>
> What do you think?
>
I will start planning this work. Feel free to add more details.
I Will have more questions as I start working on it.
I will separate GMBA work from this work.
Will send both series separately.
Thanks for details and summary.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-20 10:07 ` Ben Horgan
2026-02-20 18:39 ` Reinette Chatre
@ 2026-02-21 0:12 ` Moger, Babu
1 sibling, 0 replies; 114+ messages in thread
From: Moger, Babu @ 2026-02-21 0:12 UTC (permalink / raw)
To: Ben Horgan, Reinette Chatre, Babu Moger, corbet, tony.luck,
Dave.Martin, james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Ben,
On 2/20/2026 4:07 AM, Ben Horgan wrote:
> Hi Reinette, Babu,
>
> On 2/12/26 03:51, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 2/11/26 1:18 PM, Babu Moger wrote:
>>> On 2/11/26 10:54, Reinette Chatre wrote:
>>>> On 2/10/26 5:07 PM, Moger, Babu wrote:
>>>>> On 2/9/2026 12:44 PM, Reinette Chatre wrote:
>>>>>> On 1/21/26 1:12 PM, Babu Moger wrote:
>>>>>>> On AMD systems, the existing MBA feature allows the user to set a bandwidth
>>>>>>> limit for each QOS domain. However, multiple QOS domains share system
>>>>>>> memory bandwidth as a resource. In order to ensure that system memory
>>>>>>> bandwidth is not over-utilized, user must statically partition the
>>>>>>> available system bandwidth between the active QOS domains. This typically
>>>>>> How do you define "active" QoS Domain?
>>>>> Some domains may not have any CPUs associated with that CLOSID. Active meant, I'm referring to domains that have CPUs assigned to the CLOSID.
>>>> To confirm, is this then specific to assigning CPUs to resource groups via
>>>> the cpus/cpus_list files? This refers to how a user needs to partition
>>>> available bandwidth so I am still trying to understand the message here since
>>>> users still need to do this even when CPUs are not assigned to resource
>>>> groups.
>>>>
>>> It is not specific to CPU assignment. It applies to task assignment also.
>>>
>>> For example: We have 4 domains;
>>>
>>> # cat schemata
>>> MB:0=8192;1=8192;2=8192;3=8192
>>>
>>> If this group has the CPUs assigned to only first two domains. Then the group has only two active domains. Then we will only update the first two domains. The MB values in other domains does not matter.
>>
>> I see, thank you. As I understand an "active QoS domain" is something only user
>> space can designate. It may be possible for resctrl to get a sense of which QoS domains
>> are "active" when only CPUs are assigned to a resource group but when it comes to task
>> assignment it is user space that controls where tasks belonging to a group can be
>> scheduled and thus which QoS domains are "active" or not.
>>
>>>
>>> #echo "MB:0=8;1=8" > schemata
>>>
>>> # cat schemata
>>> MB:0=8;1=8;2=8192;3=8192
>>>
>>> The combined bandwidth can go up to 16(8+8) units. Each unit is 1/8 GB.
>>>
>>> With GMBA, we can set the combined limit higher level and total bandwidth will not exceed GMBA limit.
>>
>> Thank you for the confirmation.
>>
>>>
>>>>>>> results in system memory being under-utilized since not all QOS domains are
>>>>>>> using their full bandwidth Allocation.
>>>>>>>
>>>>>>> AMD PQoS Global Bandwidth Enforcement(GLBE) provides a mechanism
>>>>>>> for software to specify bandwidth limits for groups of threads that span
>>>>>>> multiple QoS Domains. This collection of QOS domains is referred to as GLBE
>>>>>>> control domain. The GLBE ceiling sets a maximum limit on a memory bandwidth
>>>>>>> in GLBE control domain. Bandwidth is shared by all threads in a Class of
>>>>>>> Service(COS) across every QoS domain managed by the GLBE control domain.
>>>>>> How does this bandwidth allocation limit impact existing MBA? For example, if a
>>>>>> system has two domains (A and B) that user space separately sets MBA
>>>>>> allocations for while also placing both domains within a "GLBE control domain"
>>>>>> with a different allocation, does the individual MBA allocations still matter?
>>>>> Yes. Both ceilings are enforced at their respective levels.
>>>>> The MBA ceiling is applied at the QoS domain level.
>>>>> The GLBE ceiling is applied at the GLBE control domain level.
>>>>> If the MBA ceiling exceeds the GLBE ceiling, the effective MBA limit will be capped by the GLBE ceiling.
>>>> It sounds as though MBA and GMBA/GLBE operates within the same parameters wrt
>>>> the limits but in examples in this series they have different limits. For example,
>>>> in the documentation patch [1] there is this:
>>>>
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=4096;1=4096;2=4096;3=4096
>>>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>>>
>>>> followed up with what it will look like in new generation [2]:
>>>>
>>>> GMB:0=4096;1=4096;2=4096;3=4096
>>>> MB:0=8192;1=8192;2=8192;3=8192
>>>> L3:0=ffff;1=ffff;2=ffff;3=ffff
>>>>
>>>> In both examples the per-domain MB ceiling is higher than the global GMB ceiling. With
>>>> above showing defaults and you state "If the MBA ceiling exceeds the GLBE ceiling,
>>>> the effective MBA limit will be capped by the GLBE ceiling." - does this mean that
>>>> MB ceiling can never be higher than GMB ceiling as shown in the examples?
>>>
>>> That is correct. There is one more information here. The MB unit is in 1/8 GB and GMB unit is 1GB. I have added that in documentation in patch 4.
>>
>> ah - right. I did not take the different units into account.
>>
>>>
>>> The GMB limit defaults to max value 4096 (bit 12 set) when the new group is created. Meaning GMB limit does not apply by default.
>>>
>>> When setting the limits, it should be set to same value in all the domains in GMB control domain. Having different value in each domain results in unexpected behavior.
>>>
>>>>
>>>> Another question, when setting aside possible differences between MB and GMB.
>>>>
>>>> I am trying to understand how user may expect to interact with these interfaces ...
>>>>
>>>> Consider the starting state example as below where the MB and GMB ceilings are the
>>>> same:
>>>>
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>
>>>> Would something like below be accurate? Specifically, showing how the GMB limit impacts the
>>>> MB limit:
>>>> # echo "GMB:0=8;2=8" > schemata
>>>> # cat schemata
>>>> GMB:0=8;1=2048;2=8;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>>
>>> Yes. That is correct. It will cap the MB setting to 8. Note that we are talking about unit differences to make it simple.
>>
>> Thank you for confirming.
>>
>>>
>>>
>>>> ... and then when user space resets GMB the MB can reset like ...
>>>>
>>>> # echo "GMB:0=2048;2=2048" > schemata
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=2048;1=2048;2=2048;3=2048
>>>>
>>>> if I understand correctly this will only apply if the MB limit was never set so
>>>> another scenario may be to keep a previous MB setting after a GMB change:
>>>>
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>>>
>>>> # echo "GMB:0=8;2=8" > schemata
>>>> # cat schemata
>>>> GMB:0=8;1=2048;2=8;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>>>
>>>> # echo "GMB:0=2048;2=2048" > schemata
>>>> # cat schemata
>>>> GMB:0=2048;1=2048;2=2048;3=2048
>>>> MB:0=8;1=2048;2=8;3=2048
>>>>
>>>> What would be most intuitive way for user to interact with the interfaces?
>>>
>>> I see that you are trying to display the effective behaviors above.
>>
>> Indeed. My goal is to get an idea how user space may interact with the new interfaces and
>> what would be a reasonable expectation from resctrl be during these interactions.
>>
>>>
>>> Please keep in mind that MB and GMB units differ. I recommend showing only the values the user has explicitly configured, rather than the effective settings, as displaying both may cause confusion.
>>
>> hmmm ... this may be subjective. Could you please elaborate how presenting the effective
>> settings may cause confusion?
>>
>>>
>>> We also need to track the previous settings so we can revert to the earlier value when needed. The best approach is to document this behavior clearly.
>>
>> Yes, this will require resctrl to maintain more state.
>>
>> Documenting behavior is an option but I think we should first consider if there are things
>> resctrl can do to make the interface intuitive to use.
>>
>>>>>>> From the description it sounds as though there is a new "memory bandwidth
>>>>>> ceiling/limit" that seems to imply that MBA allocations are limited by
>>>>>> GMBA allocations while the proposed user interface present them as independent.
>>>>>>
>>>>>> If there is indeed some dependency here ... while MBA and GMBA CLOSID are
>>>>>> enumerated separately, under which scenario will GMBA and MBA support different
>>>>>> CLOSID? As I mentioned in [1] from user space perspective "memory bandwidth"
>>>>> I can see the following scenarios where MBA and GMBA can operate independently:
>>>>> 1. If the GMBA limit is set to ‘unlimited’, then MBA functions as an independent CLOS.
>>>>> 2. If the MBA limit is set to ‘unlimited’, then GMBA functions as an independent CLOS.
>>>>> I hope this clarifies your question.
>>>> No. When enumerating the features the number of CLOSID supported by each is
>>>> enumerated separately. That means GMBA and MBA may support different number of CLOSID.
>>>> My question is: "under which scenario will GMBA and MBA support different CLOSID?"
>>> No. There is not such scenario.
>>>>
>>>> Because of a possible difference in number of CLOSIDs it seems the feature supports possible
>>>> scenarios where some resource groups can support global AND per-domain limits while other
>>>> resource groups can just support global or just support per-domain limits. Is this correct?
>>>
>>> System can support up to 16 CLOSIDs. All of them support all the features LLC, MB, GMB, SMBA. Yes. We have separate enumeration for each feature. Are you suggesting to change it ?
>>
>> It is not a concern to have different CLOSIDs between resources that are actually different,
>> for example, having LLC or MB support different number of CLOSIDs. Having the possibility to
>> allocate the *same* resource (memory bandwidth) with varying number of CLOSIDs does present a
>> challenge though. Would it be possible to have a snippet in the spec that explicitly states
>> that MB and GMB will always enumerate with the same number of CLOSIDs?
>>
>> Please see below where I will try to support this request more clearly and you can decide if
>> it is reasonable.
>>
>>>>>> can be seen as a single "resource" that can be allocated differently based on
>>>>>> the various schemata associated with that resource. This currently has a
>>>>>> dependency on the various schemata supporting the same number of CLOSID which
>>>>>> may be something that we can reconsider?
>>>>> After reviewing the new proposal again, I’m still unsure how all the pieces will fit together. MBA and GMBA share the same scope and have inter-dependencies. Without the full implementation details, it’s difficult for me to provide meaningful feedback on new approach.
>>>> The new approach is not final so please provide feedback to help improve it so
>>>> that the features you are enabling can be supported well.
>>>
>>> Yes, I am trying. I noticed that the proposal appears to affect how the schemata information is displayed(in info directory). It seems to introduce additional resource information. I don't see any harm in displaying it if it benefits certain architecture.
>>
>> It benefits all architectures.
>>
>> There are two parts to the current proposals.
>>
>> Part 1: Generic schema description
>> I believe there is consensus on this approach. This is actually something that is long
>> overdue and something like this would have been a great to have with the initial AMD
>> enabling. With the generic schema description forming part of resctrl the user can learn
>> from resctrl how to interact with the schemata file instead of relying on external information
>> and documentation.
>>
>> For example, on an Intel system that uses percentage based proportional allocation for memory
>> bandwidth the new resctrl files will display:
>> info/MB/resource_schemata/MB/type:scalar linear
>> info/MB/resource_schemata/MB/unit:all
>> info/MB/resource_schemata/MB/scale:1
>> info/MB/resource_schemata/MB/resolution:100
>> info/MB/resource_schemata/MB/tolerance:0
>> info/MB/resource_schemata/MB/max:100
>> info/MB/resource_schemata/MB/min:10
>>
>>
>> On an AMD system that uses absolute allocation with 1/8 GBps steps the files will display:
>> info/MB/resource_schemata/MB/type:scalar linear
>> info/MB/resource_schemata/MB/unit:GBps
>> info/MB/resource_schemata/MB/scale:1
>> info/MB/resource_schemata/MB/resolution:8
>> info/MB/resource_schemata/MB/tolerance:0
>> info/MB/resource_schemata/MB/max:2048
>> info/MB/resource_schemata/MB/min:1
>>
>> Having such interface will be helpful today. Users do not need to first figure out
>> whether they are on an AMD or Intel system, and then read the docs to learn the AMD units,
>> before interacting with resctrl. resctrl will be the generic interface it intends to be.
>>
>> Part 2: Supporting multiple controls for a single resource
>> This is a new feature on which there also appears to be consensus that is needed by MPAM and
>> Intel RDT where it is possible to use different controls for the same resource. For example,
>> there can be a minimum and maximum control associated with the memory bandwidth resource.
>>
>> For example,
>> info/
>> └─ MB/
>> └─ resource_schemata/
>> ├─ MB/
>> ├─ MB_MIN/
>> ├─ MB_MAX/
>> ┆
>>
>>
>> Here is where the big question comes in for GLBE - is this actually a new resource
>> for which resctrl needs to add interfaces to manage its allocation, or is it instead
>> an additional control associated with the existing memory bandwith resource?
>>
>> For me things are actually pointing to GLBE not being a new resource but instead being
>> a new control for the existing memory bandwidth resource.
>>
>> I understand that for a PoC it is simplest to add support for GLBE as a new resource as is
>> done in this series but when considering it as an actual unique resource does not seem
>> appropriate since resctrl already has a "memory bandwidth" resource. User space expects
>> to find all the resources that it can allocate in info/ - I do not think it is correct
>> to have two separate directories/resources for memory bandwidth here.
>>
>> What if, instead, it looks something like:
>>
>> info/
>> └── MB/
>> └── resource_schemata/
>> ├── GMB/
>> │ ├── max:4096
>> │ ├── min:1
>> │ ├── resolution:1
>> │ ├── scale:1
>> │ ├── tolerance:0
>> │ ├── type:scalar linear
>> │ └── unit:GBps
>> └── MB/
>> ├── max:8192
>> ├── min:1
>> ├── resolution:8
>> ├── scale:1
>> ├── tolerance:0
>> ├── type:scalar linear
>> └── unit:GBps
>>
>> With an interface like above GMB is just another control/schema used to allocate the
>> existing memory bandwidth resource. With the planned files it is possible to express the
>> different maximums and units used by the MB and GMB schema. Users no longer need to
>> dig for the unit information in the docs, it is available in the interface.
>>
>> Doing something like this does depend on GLBE supporting the same number of CLOSIDs
>> as MB, which seems to be how this will be implemented. If there is indeed a confirmation
>> of this from AMD architecture then we can do something like this in resctrl.
>
> I haven't fully understood what GLBE is but in MPAM we have an optional
> feature in MSC (MPAM devices) called partid narrowing. For some MSC
> there are limited controls and the incoming partid is mapped to an
> effective partid using a mapping. This mapping is software controllable.
> Dave (with Shaopeng and Zeng) has a proposal to use this to use partid
> bits as pmg bits, [1]. This usage would have to be opt-in as it changes
> the number of closid/rmid that MPAM presents to resctrl. If however, the
> user doesn't use that scheme then the controls could be presented as
> controls for groups of closid in resctrl. Is this similar/usable with
> the same interface as GLBE or have I misunderstood?
GLBE is only specific to AMD to address the limitation with memory
bandwidth (MB) allocation.
I didn't see any similarities with these features.
thanks
Babu
>
> [1]
> https://lore.kernel.org/linux-arm-kernel/20241212154000.330467-1-Dave.Martin@arm.com/
>
>>
>> There is a "part 3" to the proposals that attempts to address the new requirement where
>> some of the controls allocate at a different scope while also requiring monitoring at
>> that new scope. After learning more about GLBE this does not seem relevant to GLBE but is
>> something to return to for the "MPAM CPU-less" work. We could already prepare for this
>> by adding the new "scope" schema property though.
>>
>>
>> Reinette
>>
>>>
>>> Thanks
>>>
>>> Babu
>>>
>>>
>>>>
>>>> Reinette
>>>>
>>>> [1] https://lore.kernel.org/lkml/d58f70592a4ce89e744e7378e49d5a36be3fd05e.1769029977.git.babu.moger@amd.com/
>>>> [2] https://lore.kernel.org/lkml/e0c79c53-489d-47bf-89b9-f1bb709316c6@amd.com/
>>>>
>>
>>
>
>
> Thanks,
>
> Ben
>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-20 18:39 ` Reinette Chatre
@ 2026-02-23 9:29 ` Ben Horgan
0 siblings, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-23 9:29 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, Moger, Babu, corbet, tony.luck,
Dave.Martin, james.morse, tglx, mingo, bp, dave.hansen
Cc: x86, hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, akpm, pawan.kumar.gupta,
pmladek, feng.tang, kees, arnd, fvdl, lirongqing, bhelgaas,
seanjc, xin, manali.shukla, dapeng1.mi, chang.seok.bae,
mario.limonciello, naveen, elena.reshetova, thomas.lendacky,
linux-doc, linux-kernel, kvm, peternewman, eranian,
gautham.shenoy
Hi Reinette,
On 2/20/26 18:39, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/20/26 2:07 AM, Ben Horgan wrote:
>>
>> I haven't fully understood what GLBE is but in MPAM we have an optional
>> feature in MSC (MPAM devices) called partid narrowing. For some MSC
>> there are limited controls and the incoming partid is mapped to an
>> effective partid using a mapping. This mapping is software controllable.
>> Dave (with Shaopeng and Zeng) has a proposal to use this to use partid
>> bits as pmg bits, [1]. This usage would have to be opt-in as it changes
>> the number of closid/rmid that MPAM presents to resctrl. If however, the
>> user doesn't use that scheme then the controls could be presented as
>> controls for groups of closid in resctrl. Is this similar/usable with
>> the same interface as GLBE or have I misunderstood?
>>
>> [1]
>> https://lore.kernel.org/linux-arm-kernel/20241212154000.330467-1-Dave.Martin@arm.com/
>
> On a high level these look like different capabilities to me but I look forward to
> hear from others to understand where I may be wrong.
>
> As I understand the feature you refer to is a way in which MPAM can increase the
> number of hardware monitoring IDs available(*). It does so by using the PARTID
> narrowing feature while taking advantage of the fact that PARTID for filtering
> resource monitors is always a "request PARTID". In itself I understand the PARTID
> narrowing feature to manage how resource allocation of a *single* "MPAM component"
> is managed.
>
> On the other hand I see GLBE as a feature that essentially allows the scope of
> allocation to span multiple domains/components.
>
> As I see it, applying GLBE to MPAM would need the capability to, for example,
> set a memory bandwidth MAX that is shared across multiple MPAM components.
Thanks for the explanation. They do seem like orthogonal features. Sorry
for the noise.
>
> Reinette
>
> * as a sidenote it is not clear to me why this would require an opt-in since
> there only seems benefits to this.
On systems with a mix of intPARTID capable and non-intPARTID capable MSC
(with more partids) then on the non-intPARTID capable MSC you'll have to
use 2 partids as one and can then not use max per partid controls on
that component. Also, when using cdp on the caches we may want to use
partid narrowing to hide it for memory allocation. However, it might be
sensible to make the monitoring id increase the default for some shapes
of platform.
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-19 18:14 ` Reinette Chatre
@ 2026-02-23 9:48 ` Ben Horgan
0 siblings, 0 replies; 114+ messages in thread
From: Ben Horgan @ 2026-02-23 9:48 UTC (permalink / raw)
To: Reinette Chatre
Cc: Moger, Babu, Moger, Babu, Luck, Tony, Drew Fustini,
corbet@lwn.net, Dave.Martin@arm.com, james.morse@arm.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, eranian@google.com,
Shenoy, Gautham Ranjal
Hi Reinette,
On 2/19/26 18:14, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/19/26 2:21 AM, Ben Horgan wrote:
>> On 2/17/26 18:51, Reinette Chatre wrote:
>>> On 2/16/26 7:18 AM, Ben Horgan wrote:
>>>> On Thu, Feb 12, 2026 at 10:37:21AM -0800, Reinette Chatre wrote:
>>>>> On 2/12/26 5:55 AM, Ben Horgan wrote:
>>>>>> On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
>>>>>>> On 2/11/26 8:40 AM, Ben Horgan wrote:
>>>>>>>> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>>>
>>>>>>>>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>>>>>>>>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>>>>>>>>> instead of CPL0 using something like "kernel" or ... ?
>>>>>>>>
>>>>>>>> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
>>>>>>>> internally and here are a few thoughts.
>>>>>>>>
>>>>>>>> If the user case is just that an option run all tasks with the same closid/rmid
>>>>>>>> (partid/pmg) configuration when they are running in the kernel then I'd favour a
>>>>>>>> mount option. The resctrl filesytem interface doesn't need to change and
>>>>>>>
>>>>>>> I view mount options as an interface of last resort. Why would a mount option be needed
>>>>>>> in this case? The existence of the file used to configure the feature seems sufficient?
>>>>>>
>>>>>> If we are taking away a closid from the user then the number of CTRL_MON groups
>>>>>> that can be created changes. It seems reasonable for user-space to expect
>>>>>> num_closid to be a fixed value.
>>>>>
>>>>> I do you see why we need to take away a CLOSID from the user. Consider a user space that
>>>>
>>>> Yes, just slightly simpler to take away a CLOSID but could just go with the
>>>> default CLOSID is also used for the kernel. I would be ok with a file saying the
>>>> mode, like the mbm_event file does for counter assignment. It slightly misleading
>>>> that a configuration file is under info but necessary as we don't have another
>>>> location global to the resctrl mount.
>>>
>>> Indeed, the "info" directory has evolved more into a "config" directory.
>>>
>>>>> runs with just two resource groups, for example, "high priority" and "low priority", it seems
>>>>> reasonable to make it possible to let the "low priority" tasks run with "high priority"
>>>>> allocations when in kernel space without needing to dedicate a new CLOSID? More reasonable
>>>>> when only considering memory bandwidth allocation though.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also ...
>>>>>>>
>>>>>>> I do not think resctrl should unnecessarily place constraints on what the hardware
>>>>>>> features are capable of. As I understand, both PLZA and MPAM supports use case where
>>>>>>> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
>>>>>>> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
>>>>>>> This may be because I am not familiar with all the requirements here so please do
>>>>>>> help with insight on how the hardware feature is intended to be used as it relates
>>>>>>> to its design.
>>>>>>>
>>>>>>> We have to be very careful when constraining a feature this much If resctrl does something
>>>>>>> like this it essentially restricts what users could do forever.
>>>>>>
>>>>>> Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
>>>>>> fixed kernel CLOSID/RMID configuration option might just give all we need for
>>>>>> usecases we know we have and be minimally intrusive enough to not preclude a
>>>>>> more featureful PLZA later when new usecases come about.
>>>>>
>>>>> Having ability to grow features would be ideal. I do not see how a fixed kernel CLOSID/RMID
>>>>> configuration leaves room to build on top though. Could you please elaborate?
>>>>
>>>> If we initially go with a single new configuration file, e.g. kernel_mode, which
>>>> could be "match_user" or "use_root, this would be the only initial change to the
>>>> interface needed. If more usecases present themselves a new mode could be added,
>>>> e.g. "configurable", and an interface to actually change the rmid/closid for the
>>>> kernel could be added.
>>>
>>> Something like this could be a base to work from. I think only the two ("match_user" and
>>> "use_root") are a bit limiting for even the initial implementation though.
>>> As I understand, "use_root" implies using the allocations of the default group but
>>> does not indicate what MON group (which RMID/PMG) should be used to monitor the
>>> work done in kernel space. A way to specify the actual group may be needed?
>>
>> Yeah, I'm not sure that flexibility is strictly necessary but will make
>> the interface easier to use.
>
> I find your proposal to be a good foundation to build on. I am in process of trying out
> some ideas around it for consideration and comparison to other ideas.
>
> ...
>
>>>>> existing "tasks" file does but only supports the same CLOSID/RMID for both user
>>>>> space and kernel space. To support the new hardware features where the CLOSID/RMID
>>>>> can be different we cannot just change "tasks" interface and would need to keep it
>>>>> backward compatible. So far I assumed that it would be ok for the "tasks" file
>>>>> to essentially get new meaning as the CLOSID/RMID for just user space work, which
>>>>> seems to require a second file for kernel space as a consequence? So far I have
>>>>> not seen an option that does not change meaning of the "tasks" file.
>>>>
>>>> Would it make sense to have some new type of entries in the tasks file,
>>>> e.g. k_ctrl_<pid>, k_mon_<pid> to say, in the kernel, use the closid of this
>>>> CTRL_MON for this task pid or use the rmid of this CTRL_MON/MON group for this task
>>>> pid? We would still probably need separate files for the cpu configuration.
>>>
>>> I am obligated to nack such a change to the tasks file since it would impact any
>>> existing user space parsing of this file.
>>>
>>
>> Good to know. Do you consider the format of the tasks file fully fixed?
>
> At this point I believe it is fully fixed, yes. For this we need to consider both
> how it is documented to be used and how it is used. For the former we of course have
> Documentation/filesystems/resctrl.rst but for the latter it becomes difficult.
>
> On the documentation side I also find existing documentation to be specific in how
> "tasks" file should be interpreted: "Reading this file shows the list of all tasks
> that belong to this group.". I do not find there to be a lot of room for changing
> interpretation here.
>
> An interface change as you suggest is reasonable for a file that is consumed by a
> human - somebody can read the file and immediately notice the change and it may even
> be intuitive. We know that there is a lot of tooling built around resctrl fs though
> so we should evaluate impact of any interface changes on such automation. Not all of this
> tooling is public so this is where things become difficult to predict the impact so
> we tend to be conservative in assumptions here.
>
> There is one open source resctrl fs tool, the "pqos" utility [1], that is getting a lot of
> usage and it could be a predictor (albeit not decider) of such interface change impact.
> A peek at how it parses the "tasks" file confirms that it only expects a number, see
> resctrl_alloc_task_read() at https://github.com/intel/intel-cmt-cat/blob/master/lib/resctrl_alloc.c#L437
> I thus expect that a user running pqos on a kernel that contains such a change to the
> "tasks" file will fail which confirms changing syntax of "tasks" file should be avoided.
Thanks for this information. This will help when trying to think how we
can add any further features.
>
> Reinette
>
> [1] https://github.com/intel/intel-cmt-cat
>
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-20 2:53 ` Reinette Chatre
2026-02-20 22:44 ` Moger, Babu
@ 2026-02-23 10:08 ` Ben Horgan
2026-02-23 16:38 ` Reinette Chatre
1 sibling, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-23 10:08 UTC (permalink / raw)
To: Reinette Chatre, Luck, Tony, Moger, Babu, eranian@google.com
Cc: Moger, Babu, Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Reinette,
On 2/20/26 02:53, Reinette Chatre wrote:
> Hi Tony, Ben, Babu, and Stephane,
>
> On 2/18/26 8:44 AM, Luck, Tony wrote:
>> On Tue, Feb 17, 2026 at 03:55:44PM -0800, Reinette Chatre wrote:
>>> Hi Tony,
>>>
>>> On 2/17/26 2:52 PM, Luck, Tony wrote:
>>>> On Tue, Feb 17, 2026 at 02:37:49PM -0800, Reinette Chatre wrote:
>>>>> Hi Tony,
>>>>>
>>>>> On 2/17/26 1:44 PM, Luck, Tony wrote:
>>>>>>>>>> I'm not sure if this would happen in the real world or not.
>>>>>>>>>
>>>>>>>>> Ack. I would like to echo Tony's request for feedback from resctrl users
>>>>>>>>> https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/
>>>>>>>>
>>>>>>>> Indeed. This is all getting a bit complicated.
>>>>>>>>
>>>>>>>
>>>>>>> ack
>>>>>>
>>>>>> We have several proposals so far:
>>>>>>
>>>>>> 1) Ben's suggestion to use the default group (either with a Babu-style
>>>>>> "plza" file just in that group, or a configuration file under "info/").
>>>>>>
>>>>>> This is easily the simplest for implementation, but has no flexibility.
>>>>>> Also requires users to move all the non-critical workloads out to other
>>>>>> CTRL_MON groups. Doesn't steal a CLOSID/RMID.
>>>>>>
>>>>>> 2) My thoughts are for a separate group that is only used to configure
>>>>>> the schemata. This does allocate a dedicated CLOSID/RMID pair. Those
>>>>>> are used for all tasks when in kernel mode.
>>>>>>
>>>>>> No context switch overhead. Has some flexibility.
>>>>>>
>>>>>> 3) Babu's RFC patch. Designates an existing CTRL_MON group as the one
>>>>>> that defines kernel CLOSID/RMID. Tasks and CPUs can be assigned to this
>>>>>> group in addition to belonging to another group than defines schemata
>>>>>> resources when running in non-kernel mode.
>>>>>> Tasks aren't required to be in the kernel group, in which case they
>>>>>> keep the same CLOSID in both user and kernel mode. When used in this
>>>>>> way there will be context switch overhead when changing between tasks
>>>>>> with different kernel CLOSID/RMID.
>>>>>>
>>>>>> 4) Even more complex scenarios with more than one user configurable
>>>>>> kernel group to give more options on resources available in the kernel.
>>>>>>
>>>>>>
>>>>>> I had a quick pass as coding my option "2". My UI to designate the
>>>>>> group to use for kernel mode is to reserve the name "kernel_group"
>>>>>> when making CTRL_MON groups. Some tweaks to avoid creating the
>>>>>> "tasks", "cpus", and "cpus_list" files (which might be done more
>>>>>> elegantly), and "mon_groups" directory in this group.
>>>>>
>>>>> Should the decision of whether context switch overhead is acceptable
>>>>> not be left up to the user?
>>>>
>>>> When someone comes up with a convincing use case to support one set of
>>>> kernel resources when interrupting task A, and a different set of
>>>> resources when interrupting task B, we should certainly listen.
>>>
>>> Absolutely. Someone can come up with such use case at any time tough. This
>>> could be, and as has happened with some other resctrl interfaces, likely will be
>>> after this feature has been supported for a few kernel versions. What timeline
>>> should we give which users to share their use cases with us? Even if we do hear
>>> from some users will that guarantee that no such use case will arise in the
>>> future? Such predictions of usage are difficult for me and I thus find it simpler
>>> to think of flexible ways to enable the features that we know the hardware supports.
>>>
>>> This does not mean that a full featured solution needs to be implemented from day 1.
>>> If folks believe there are "no valid use cases" today resctrl still needs to prepare for
>>> how it can grow to support full hardware capability and hardware designs in the
>>> future.
>>>
>>> Also, please also consider not just resources for kernel work but also monitoring for
>>> kernel work. I do think, for example, a reasonable use case may be to determine
>>> how much memory bandwidth the kernel uses on behalf of certain tasks.
>>>
>>>>> I assume that, just like what is currently done for x86's MSR_IA32_PQR_ASSOC,
>>>>> the needed registers will only be updated if there is a new CLOSID/RMID needed
>>>>> for kernel space.
>>>>
>>>> Babu's RFC does this.
>>>
>>> Right.
>>>
>>>>
>>>>> Are you suggesting that just this checking itself is too
>>>>> expensive to justify giving user space more flexibility by fully enabling what
>>>>> the hardware supports? If resctrl does draw such a line to not enable what
>>>>> hardware supports it should be well justified.
>>>>
>>>> The check is likley light weight (as long as the variables to be
>>>> compared reside in the same cache lines as the exisitng CLOSID
>>>> and RMID checks). So if there is a use case for different resources
>>>> when in kernel mode, then taking this path will be fine.
>>>
>>> Why limit this to knowing about a use case? As I understand this feature can be
>>> supported in a flexible way without introducing additional context switch overhead
>>> if the user prefers to use just one allocation for all kernel work. By being
>>> configurable and allowing resctrl to support more use cases in the future resctrl
>>> does not paint itself into a corner. This allows resctrl to grow support so that
>>> the user can use all capabilities of the hardware with understanding that it will
>>> increase context switch time.
>>>
>>> Reinette
>>
>> How about this idea for extensibility.
>>
>> Rename Babu's "plza" file to "plza_mode". Instead of just being an
>> on/off switch, it may accept multiple possible requests.
>>
>> Humorous version:
>>
>> # echo "babu" > plza_mode
>>
>> This results in behavior of Babu's RFC. The CLOSID and RMID assigned to
>> the CTRL_MON group are used when in kernel mode, but only for tasks that
>> have their task-id written to the "tasks" file or for tasks in the
>> default group in the "cpus" or "cpus_list" files are used to assign
>> CPUs to this group.
>>
>> # echo "tony" > plza_mode
>>
>> All tasks run with the CLOSID/RMID for this group. The "tasks", "cpus" and
>> "cpus_list" files and the "mon_groups" directory are removed.
>>
>> # echo "ben" > plza_mode"
>>
>> Only usable in the top-level default CTRL_MON directory. CLOSID=0/RMID=0
>> are used for all tasks in kernel mode.
>>
>> # echo "stephane" > plza_mode
>>
>> The RMID for this group is freed. All tasks run in kernel mode with the
>> CLOSID for this group, but use same RMID for both user and kernel.
>> In addition to files removed in "tony" mode, the mon_data directory is
>> removed.
>>
>> # echo "some-future-name" > plza_mode
>>
>> Somebody has a new use case. Resctrl can be extended by allowing some
>> new mode.
>>
>>
>> Likely real implementation:
>>
>> Sub-components of each of the ideas above are encoded as a bitmask that
>> is written to plza_mode. There is a file in the info/ directory listing
>> which bits are supported on the current system (e.g. the "keep the same
>> RMID" mode may be impractical on ARM, so it would not be listed as an
>> option.)
>
> I like the idea of a global file that indicates what is supported on the
> system. I find this to match Ben's proposal of a "kernel_mode" file in
> info/ that looks to be a good foundation to build on. Ben also reiterated support
> for this in
> https://lore.kernel.org/lkml/feaa16a5-765c-4c24-9e0b-c1f4ef87a66f@arm.com/
>
> As I mentioned in https://lore.kernel.org/lkml/5c19536b-aca0-42ce-a9d5-211fbbdbb485@intel.com/
> the suggestions surrounding the per-resource group "plza_mode" file
> are unexpected since they ignore earlier comments about impact on user space.
> Specifically, this proposal does not address:
> https://lore.kernel.org/lkml/aY3bvKeOcZ9yG686@e134344.arm.com/
> https://lore.kernel.org/lkml/c779ce82-4d8a-4943-b7ec-643e5a345d6c@arm.com/
>
> Below I aim to summarize the discussions as they relate to constraints and
> requirements. I intended to capture all that has been mentioned in these
> discussions so far so if I did miss something it was not intentional and
> please point this out to help make this summary complete.
>
> I hope by starting with this we can start with at least agreeing what
> resctrl needs to support and how user space could interact with resctrl
> to meet requirements.
>
> After the summary of what resctrl needs to support I aim to combine
> capabilities from the various proposals to meet the constraints and
> requirements as I understand them so far. This aims to build on all that
> has been shared until now.
>
> Any comments are appreciated.
>
> Summary of considerations surrounding CLOSID/RMID (PARTID/PMG) assignment for kernel work
> =========================================================================================
>
> - PLZA currently only supports global assignment (only PLZA_EN of
> MSR_IA32_PQR_PLZA_ASSOC may differ on logical processors). Even so, current
> speculation is that RMID_EN=0 implies that user space RMID is used to monitor
> kernel work that could appear to user as "kernel mode" supporting multiple RMIDs.
> https://lore.kernel.org/lkml/abb049fa-3a3d-4601-9ae3-61eeb7fd8fcf@amd.com/
>
> - MPAM can set unique PARTID and PMG on every logical processor.
> https://lore.kernel.org/lkml/fd7e0779-7e29-461d-adb6-0568a81ec59e@arm.com/
>
> - While current PLZA only supports global assignment it may in future generations
> not require MSR_IA32_PQR_PLZA_ASSOC to be same on logical processors. resctrl
> thus needs to be flexible here.
> https://lore.kernel.org/lkml/fa45088b-1aea-468e-8253-3238e91f76c7@amd.com/
>
> - No equivalent feature on RISC-V.
> https://lore.kernel.org/lkml/aYvP98xGoKPrDBCE@gen8/
>
> - Impact on context switch delay is a concern and unnecessary context switch delay should
> be avoided.
> https://lore.kernel.org/lkml/aZThTzdxVcBkLD7P@agluck-desk3/
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>
> - There is no requirement that a CLOSID/PARTID should be dedicated to kernel work.
> Specifically, same CLOSID/PARTID can be used for user space and kernel work.
> Also directly requested to not make kernel work CLOSID/PARTID exclusive:
> https://lore.kernel.org/lkml/c8268b2a-50d7-44b4-ac3f-5ce6624599b1@arm.com/
>
> - Only use case presented so far is related to memory bandwidth allocation where
> all kernel work is done unthrottled or equivalent to highest priority tasks while
> monitoring remains associated to task self.
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
> PLZA can support this with its global allocation (assuming RMID_EN=0 associates user
> RMID with kernel work) To support this use case MPAM would need to be able to
> change both PARTID and PMG:
> https://lore.kernel.org/lkml/845587f3-4c27-46d9-83f8-6b38ccc54183@arm.com/
>
> - Motivation of this work is to run kernel work with more/all/unthrottled
> resources to avoid priority inversions. We need to be careful with such
> generalization since not all resource allocations are alike yet a CLOSID/PARTID
> assignment applies to all resources. For example, user may designate a cache
> portion for high priority user space work and then needs to choose which cache
> portions the kernel may allocate into.
> https://lore.kernel.org/lkml/6293c484-ee54-46a2-b11c-e1e3c736e578@arm.com/
> - If all kernel work is done using the same allocation/CLOSID/PARTID then user
> needs to decide whether the kernel work's cache allocation overlaps the high
> priority tasks or not. To avoid evicting high priority task work it may be
> simplest for kernel allocation to not overlap high priority work but kernel work
> done on behalf of high priority work would then risk eviction by low priority
> work.
> - When considering cache allocation it seems more flexible to have high priority
> work keep its cache allocation when entering the kernel? This implies more than
> one CLOSID/PARTID may need to be used for kernel work.
>
>
> TBD
> ===
> - What is impact of different controls (for example the upcoming MAX) when tasks are
> spread across multiple control groups?
> https://lore.kernel.org/lkml/aY3bvKeOcZ9yG686@e134344.arm.com/
>
> How can MPAM support the "monitor kernel work with user space work" use case?
> =============================================================================
> This considers how MPAM could support the use case presented in:
> https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>
> To support this use case in MPAM the control group that dictates the allocations
> used in kernel work has to have monitor group(s) where this usage is tracked and user
> space would need to sum the kernel and user space usage. The number of PMG may vary
> and resctrl cannot assume that the kernel control group would have sufficient monitor
> groups to map 1:1 with user space control and monitor groups. Mapping user space
> control and monitor groups to kernel monitor groups thus seems best to be done by
> user space.
>
> Some examples:
> Consider allocation and monitoring setup for user space work:
> /sys/fs/resctrl <= User space default allocations
> /sys/fs/resctrl/g1 <= User space allocations g1
> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
> /sys/fs/resctrl/g2 <= User space allocations g2
> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>
> Having a single control group for kernel work and a system that supports
> 7 PMG per PARTID makes it possible to have a monitoring group for each user space
> monitoring group:
> (will go more into how such assignments can be made later)
>
> /sys/fs/resctrl/kernel <= Kernel space allocations
> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring default group
> /sys/fs/resctrl/kernel/mon_groups/kernel_g1 <= Kernel space monitoring group g1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g1m1 <= Kernel space monitoring group g1m1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g1m2 <= Kernel space monitoring group g1m2
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring group g2
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2m1 <= Kernel space monitoring group g2m1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2m2 <= Kernel space monitoring group g2m2
>
> With a configuration as above user space can sum the monitoring events of the user space
> groups and associated kernel space groups to obtain counts of all work done on behalf of
> associated tasks.
>
> It may not be possible to have such 1:1 relationship and user space would have to
> arrange groups to match its usage. For example if system only supports two PMG per PARTID
> then user space may find it best to track monitoring as below:
> /sys/fs/resctrl/kernel <= Kernel space allocations
> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
>
>
> Requirements
> ============
> Based on understanding of what PLZA and MPAM is (and could be) capable of while considering the
> use case presented thus far it seems that resctrl has to:
> - support global assignment of resource group for kernel work
> - support per-resource group assignment for kernel work
>
> How can resctrl support the requirements?
> =========================================
>
> New global resctrl fs files
> ===========================
> info/kernel_mode (always visible)
> info/kernel_mode_assignment (visibility and content depends on active setting in info/kernel_mode)
>
> info/kernel_mode
> ================
> - Displays the currently active as well as possible features available to user
> space.
> - Single place where user can query "kernel mode" behavior and capabilities of the
> system.
> - Some possible values:
> - inherit_ctrl_and_mon <=== previously named "match_user", just renamed for consistency with other names
> When active, kernel and user space use the same CLOSID/RMID. The current status
> quo for x86.
> - global_assign_ctrl_inherit_mon
> When active, CLOSID/control group can be assigned for *all* (hence, "global")
> kernel work while all kernel work uses same RMID as user space.
> Can only be supported on architecture where CLOSID and RMID are independent.
> An arch may support this in hardware (RMID_EN=0?) or this can be done by resctrl during
> context switch if the RMID is independent and the context switches cost is
> considered "reasonable".
> This supports use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
> for PLZA.
> - global_assign_ctrl_assign_mon
> When active the same resource group (CLOSID and RMID) can be assigned to
> *all* kernel work. This could be any group, including the default group.
> There may not be a use case for this but it could be useful as an intemediate
> step of the mode that follow (more later).
> - per_group_assign_ctrl_assign_mon
> When active every resource group can be associated with another (or the same)
> resource group. This association maps the resource group for user space work
> to resource group for kernel work. This is similar to the "kernel_group" idea
> presented in:
> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
> This addresses use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
> for MPAM.
> - Additional values can be added as new requirements arise, for example "per_task"
> assignment. Connecting visibility of info/kernel_mode_assignment to mode in
> info/kernel_mode enables resctrl to later support additional modes that may require
> different configuration files, potentially per-resource group like the "tasks_kernel"
> (or perhaps rather "kernel_mode_tasks" to have consistent prefix for this feature)
> and "cpus_kernel" ("kernel_mode_cpus"?) discussed in these threads.
>
> User can view active and supported modes:
>
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
> global_assign_ctrl_inherit_mon
> global_assign_ctrl_assign_mon
>
> User can switch modes:
> # echo global_assign_ctrl_inherit_mon > kernel_mode
> # cat kernel_mode
> inherit_ctrl_and_mon
> [global_assign_ctrl_inherit_mon]
> global_assign_ctrl_assign_mon
>
>
> info/kernel_mode_assignment
> ===========================
> - Visibility depends on active mode in info/kernel_mode.
> - Content depends on active mode in info/kernel_mode
> - Syntax to identify resource groups can use the syntax created as part of earlier ABMC work
> that supports default group https://lore.kernel.org/lkml/cover.1737577229.git.babu.moger@amd.com/
> - Default CTRL_MON group and if relevant, the default MON group, can be the default
> assignment when user just changes the kernel_mode without setting the assignment.
>
> info/kernel_mode_assignment when mode is global_assign_ctrl_inherit_mon
> -----------------------------------------------------------------------
> - info/kernel_mode_assignment contains single value that is the name of the control group
> used for all kernel work.
> - CLOSID/PARTID used for kernel work is determined from the control group assigned
> - default value is default CTRL_MON group
> - no monitor group assignment, kernel work inherits user space RMID
> - syntax is
> <CTRL_MON group> with "/" meaning default.
>
> info/kernel_mode_assignment when mode is global_assign_ctrl_assign_mon
> -----------------------------------------------------------------------
> - info/kernel_mode_assignment contains single value that is the name of the resource group
> used for all kernel work.
> - Combined CLOSID/RMID or combined PARTID/PMG is set globally to be associated with all
> kernel work.
> - default value is default CTRL_MON group
> - syntax is
> <CTRL_MON group>/MON group>/ with "//" meaning default control and default monitoring group.
>
> info/kernel_mode_assignment when mode is per_group_assign_ctrl_assign_mon
> -------------------------------------------------------------------------
> - this presents the information proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
> within a single file for convenience and potential optimization when user space needs to make changes.
> Interface proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/ is also an option
> and as an alternative a per-resource group "kernel_group" can be made visible when user space enables
> this mode.
> - info/kernel_mode_assignment contains a mapping of every resource group to another resource group:
> <resource group for user space work>:<resource group for kernel work>
> - all resource groups must be present in first field of this file
> - Even though this is a "per group" setting expectation is that this will set the
> kernel work CLOSID/RMID for every task. This implies that writing to this file would need
> to access the tasklist_lock that, when taking for too long, may impact other parts of system.
> See https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
>
> Scenarios supported
> ===================
>
> Default
> -------
> For x86 I understand kernel work and user work to be done with same CLOSID/RMID which
> implies that info/kernel_mode can always be visible and at least display:
> # cat info/kernel_mode
> [inherit_ctrl_and_mon]
>
> info/kernel_mode_assignment is not visible in this mode.
>
> I understand MPAM may have different defaults here so would like to understand better.
>
> Dedicated global allocations for kernel work, monitoring same for user space and kernel (PLZA)
> ----------------------------------------------------------------------------------------------
> Possible scenario with PLZA, not MPAM (see later):
> 1. Create group(s) to manage allocations associated with user space work
> and assign tasks/CPUs to these groups.
> 2. Create group to manage allocations associated with all kernel work.
> - For example,
> # mkdir /sys/fs/resctrl/unthrottled
> - No constraints from resctrl fs on interactions with files in this group. From resctrl
> fs perspective it is not "dedicated" to kernel work but just another resource group.
> User space can still assign tasks/CPUs to this group that will result in this group
> to be used for both kernel and user space control and monitoring. If user space wants
> to dedicate a group to kernel work then they should not assign tasks/CPUs to it.
> 3. Set kernel mode to global_assign_ctrl_inherit_mon:
> # echo global_assign_ctrl_inherit_mon > info/kernel_mode
> - info/kernel_mode_assignment becomes visible and contains "/" to indicate that default
> resource group is used for all kernel work
> - Sets the "global" CLOSID to be used for kernel work to 0, no setting of global RMID.
> 4. Set control group to be used for all kernel work:
> # echo unthrottled > info/kernel_mode_assignment
> - Sets the "global" CLOSID to be used for kernel work to CLOSID associated with
> CTRL_MON group named "unthrottled", no change to global RMID.
>
>
> Dedicated global allocations and monitoring for kernel work
> -----------------------------------------------------------
> - Step 1 and 2 could be the same as above.
> OR
> 2b. If there is an "unthrottled" control group that is used for both user space and kernel
> allocations a separate MON group can be used to track monitoring data for kernel work.
> - For example,
> # mkdir /sys/fs/resctrl/unthrottled <=== All high priority work, kernel and user space
> # mkdir /sys/fs/resctrl/unthrottled/mon_groups/kernel_unthrottled <= Just monitor kernel work
>
> 3. Set kernel mode to global_assign_ctrl_assign_mon:
> # echo global_assign_ctrl_assign_mon > info/kernel_mode
> - info/kernel_mode_assignment becomes visible and contains "//" - default CTRL_MON is
> used for all kernel work allocations and monitoring
> - Sets both the "global" CLOSID and RMID to be used for kernel work to 0.
> 4. Set control group to be used for all kernel work:
> # echo unthrottled/kernel_unthrottled > info/kernel_mode_assignment
> - Sets the "global" CLOSID to be used for kernel work to CLOSID associated with
> CTRL_MON group named "unthrottled" and RMID used for kernel work to RMID
> associated with child MON group within "unthrottled" group named "kernel_untrottled".
>
> Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
> ----------------------------------------------------------------------------------------------
> 1. User space creates resource and monitoring groups for user tasks:
> /sys/fs/resctrl <= User space default allocations
> /sys/fs/resctrl/g1 <= User space allocations g1
> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
> /sys/fs/resctrl/g2 <= User space allocations g2
> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>
> 2. User space creates resource and monitoring groups for kernel work (system has two PMG):
> /sys/fs/resctrl/kernel <= Kernel space allocations
> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
> 3. Set kernel mode to per_group_assign_ctrl_assign_mon:
> # echo per_group_assign_ctrl_assign_mon > info/kernel_mode
> - info/kernel_mode_assignment becomes visible and contains
> # cat info/kernel_mode_assignment
> //://
> g1//://
> g1/g1m1/://
> g1/g1m2/://
> g2//://
> g2/g2m1/://
> g2/g2m2/://
> - An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
> similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
> avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
> user space to likely change it.
> 4. Set groups to be used for kernel work:
> # echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
Am I right in thinking that you want this in the info directory to avoid
adding files to the CTRL_MON/MON groups?
> > The interfaces proposed aim to maintain compatibility with existing
user space tools while
> adding support for all requirements expressed thus far in an efficient way. For an existing
> user space tool there is no change in meaning of any existing file and no existing known
> resource group files are made to disappear. There is a global configuration that lets user space
> manage allocations without needing to check and configure each control group, even per-resource
> group allocations can be managed from user space with a single read/write to support
> making changes in most efficient way.
>
> What do you think?
Looks a good and well considered plan. Thank you in particular for
figuring out how MPAM fits in.
>
> Reinette
>
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-12 3:51 ` Reinette Chatre
2026-02-12 19:09 ` Babu Moger
2026-02-20 10:07 ` Ben Horgan
@ 2026-02-23 13:21 ` Fenghua Yu
2026-02-23 17:38 ` Reinette Chatre
2026-02-23 13:21 ` Fenghua Yu
3 siblings, 1 reply; 114+ messages in thread
From: Fenghua Yu @ 2026-02-23 13:21 UTC (permalink / raw)
To: reinette.chatre
Cc: Dave.Martin, akpm, arnd, babu.moger, bhelgaas, bmoger, bp,
bsegall, chang.seok.bae, corbet, dapeng1.mi, dave.hansen,
dietmar.eggemann, elena.reshetova, eranian, feng.tang, fvdl,
gautham.shenoy, hpa, james.morse, juri.lelli, kees, kvm,
linux-doc, linux-kernel, lirongqing, manali.shukla,
mario.limonciello, mgorman, mingo, naveen, pawan.kumar.gupta,
peternewman, peterz, pmladek, rostedt, seanjc, tglx,
thomas.lendacky, tony.luck, vincent.guittot, vschneid, x86, xin
Hi, Reinette,
> What if, instead, it looks something like:
>
>info/
>└── MB/
> └── resource_schemata/
> ├── GMB/
> │ ├── max:4096
> │ ├── min:1
> │ ├── resolution:1
> │ ├── scale:1
> │ ├── tolerance:0
> │ ├── type:scalar linear
> │ └── unit:GBps
> └── MB/
> ├── max:8192
> ├── min:1
> ├── resolution:8
> ├── scale:1
> ├── tolerance:0
> ├── type:scalar linear
> └── unit:GBps
May I have 2 comments?
1. This directory is for both info and control, right?
"info" is a read-only directory:
dr-xr-xr-x 8 root root 0 Feb 23 12:50 info
And its name suggests it's for info only as well.
Instead of mixing legal info and control together, is it better to add a
new "control" or "config" directory in /sys/fs/resctrl for this control
and info purpose?
2. This control method seems only handles global control for resources.
But what if a control is per domain and per closid/partid?
For example, MPAM has a hardlimit control per mem bandwidth allocation
domain per partid. When hardlimit is enabled, MPAM hardware enforces
hard limit of MBW max. This can not be controlled globally.
For this kind of per partid per domain control, propose
config_schemata/control_schemata file:
partition X/
control_schemata (or config_schemata):
MB_hardlimit: 0=0/1;1=0/1;...
Is this reasonable?
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-12 3:51 ` Reinette Chatre
` (2 preceding siblings ...)
2026-02-23 13:21 ` Fenghua Yu
@ 2026-02-23 13:21 ` Fenghua Yu
3 siblings, 0 replies; 114+ messages in thread
From: Fenghua Yu @ 2026-02-23 13:21 UTC (permalink / raw)
To: reinette.chatre
Cc: Dave.Martin, akpm, arnd, babu.moger, bhelgaas, bmoger, bp,
bsegall, chang.seok.bae, corbet, dapeng1.mi, dave.hansen,
dietmar.eggemann, elena.reshetova, eranian, feng.tang, fvdl,
gautham.shenoy, hpa, james.morse, juri.lelli, kees, kvm,
linux-doc, linux-kernel, lirongqing, manali.shukla,
mario.limonciello, mgorman, mingo, naveen, pawan.kumar.gupta,
peternewman, peterz, pmladek, rostedt, seanjc, tglx,
thomas.lendacky, tony.luck, vincent.guittot, vschneid, x86, xin
Hi, Reinette,
> What if, instead, it looks something like:
>
>info/
>└── MB/
> └── resource_schemata/
> ├── GMB/
> │ ├── max:4096
> │ ├── min:1
> │ ├── resolution:1
> │ ├── scale:1
> │ ├── tolerance:0
> │ ├── type:scalar linear
> │ └── unit:GBps
> └── MB/
> ├── max:8192
> ├── min:1
> ├── resolution:8
> ├── scale:1
> ├── tolerance:0
> ├── type:scalar linear
> └── unit:GBps
May I have 2 comments?
1. This directory is for both info and control, right?
"info" is a read-only directory:
dr-xr-xr-x 8 root root 0 Feb 23 12:50 info
And its name suggests it's for info only as well.
Instead of mixing legal info and control together, is it better to add a
new "control" or "config" directory in /sys/fs/resctrl for this control
and info purpose?
2. This control method seems only handles global control for resources.
But what if a control is per domain and per closid/partid?
For example, MPAM has a hardlimit control per mem bandwidth allocation
domain per partid. When hardlimit is enabled, MPAM hardware enforces
hard limit of MBW max. This can not be controlled globally.
For this kind of per partid per domain control, propose
config_schemata/control_schemata file:
partition X/
control_schemata (or config_schemata):
MB_hardlimit: 0=0/1;1=0/1;...
Is this reasonable?
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-23 10:08 ` Ben Horgan
@ 2026-02-23 16:38 ` Reinette Chatre
2026-02-24 9:36 ` Ben Horgan
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-23 16:38 UTC (permalink / raw)
To: Ben Horgan, Luck, Tony, Moger, Babu, eranian@google.com
Cc: Moger, Babu, Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Ben,
On 2/23/26 2:08 AM, Ben Horgan wrote:
> On 2/20/26 02:53, Reinette Chatre wrote:
...
>> Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
>> ----------------------------------------------------------------------------------------------
>> 1. User space creates resource and monitoring groups for user tasks:
>> /sys/fs/resctrl <= User space default allocations
>> /sys/fs/resctrl/g1 <= User space allocations g1
>> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
>> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
>> /sys/fs/resctrl/g2 <= User space allocations g2
>> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
>> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>>
>> 2. User space creates resource and monitoring groups for kernel work (system has two PMG):
>> /sys/fs/resctrl/kernel <= Kernel space allocations
>> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
>> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
>> 3. Set kernel mode to per_group_assign_ctrl_assign_mon:
>> # echo per_group_assign_ctrl_assign_mon > info/kernel_mode
>> - info/kernel_mode_assignment becomes visible and contains
>> # cat info/kernel_mode_assignment
>> //://
>> g1//://
>> g1/g1m1/://
>> g1/g1m2/://
>> g2//://
>> g2/g2m1/://
>> g2/g2m2/://
>> - An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
>> similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
>> avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
>> user space to likely change it.
>> 4. Set groups to be used for kernel work:
>> # echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
>
> Am I right in thinking that you want this in the info directory to avoid
> adding files to the CTRL_MON/MON groups?
I see this file as providing the same capability as you suggested in
https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/. The reason why I
presented this as a single file is not because I am trying to avoid adding
files to the CTRL_MON/MON groups but because I believe such interface enables
resctrl to have more flexibility and support more scenarios for optimization.
As you mentioned in your proposal the solution enables a single write to move
a task. As I thought through what resctrl needs to do on such write I saw a lot
of similarities with mongrp_reparent() that loops through all the tasks via
for_each_process_thread() while holding tasklist_lock. Issues with mongrp_reparent()
holding tasklist_lock for a long time are described in [1].
While the single file does not avoid taking tasklist_lock it does give the user the
ability to set kernel group for multiple user groups with a single write. When user space
does so I believe it is possible for resctrl to have an optimization that takes tasklist_lock
just once and make changes to tasks belonging to all groups while looping through all tasks on
system just once. With files within the CTRL_MON/MON groups setting kernel group for
multiple user groups will require multiple writes from user space where each write requires
looping through tasks while holding tasklist_lock during each loop. From what I learned
from [1] something like this can be very disruptive to the rest of the system.
In summary, I see having this single file provide the same capability as the
on-file-per-CTRL_MON/MON group since user can choose to set kernel group for user
group one at a time but it also gives more flexibility to resctrl for optimization.
Nothing is set in stone here. There is still flexibility in this proposal to support
PARTID and PMG assignment with a single file in each CTRL_MON/MON group if we find that
it has the more benefits. resctrl can still expose a "per_group_assign_ctrl_assign_mon" mode
but instead of making "info/kernel_mode_assignment" visible when it is enabled the control file
in CTRL_MON/MON groups are made visible ... even in this case resctrl could still add the single
file later if deemed necessary at that time.
Considering all this, do you think resctrl should rather start with a file in each
CTRL_MON/MON group?
Reinette
[1] https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-20 22:44 ` Moger, Babu
@ 2026-02-23 17:12 ` Reinette Chatre
2026-02-23 22:35 ` Moger, Babu
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-23 17:12 UTC (permalink / raw)
To: Moger, Babu, Luck, Tony, Ben Horgan, Moger, Babu,
eranian@google.com
Cc: Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Babu,
On 2/20/26 2:44 PM, Moger, Babu wrote:
> On 2/19/2026 8:53 PM, Reinette Chatre wrote:
>> Summary of considerations surrounding CLOSID/RMID (PARTID/PMG) assignment for kernel work
>> =========================================================================================
>>
>> - PLZA currently only supports global assignment (only PLZA_EN of
>> MSR_IA32_PQR_PLZA_ASSOC may differ on logical processors). Even so, current
>> speculation is that RMID_EN=0 implies that user space RMID is used to monitor
>> kernel work that could appear to user as "kernel mode" supporting multiple RMIDs.
>> https://lore.kernel.org/lkml/abb049fa-3a3d-4601-9ae3-61eeb7fd8fcf@amd.com/
>
> Yes. RMID_EN=0 means dont use separate RMID for plza.
Thank you very much for confirming.
...
>> How can resctrl support the requirements?
>> =========================================
>>
>> New global resctrl fs files
>> ===========================
>> info/kernel_mode (always visible)
>> info/kernel_mode_assignment (visibility and content depends on active setting in info/kernel_mode)
>
> Probably good idea to drop "assign" for this work. We already have mbm_assign mode and related work.
hmmm ... I think "assign" is generic enough of a word that
it cannot be claimed by a single feature.
> info/kernel_mode_assoc or info/kernel_mode_association? Or We can wait later to rename appropriately.
yes, naming can be settled later.
>
>>
>> info/kernel_mode
>> ================
>> - Displays the currently active as well as possible features available to user
>> space.
>> - Single place where user can query "kernel mode" behavior and capabilities of the
>> system.
>> - Some possible values:
>> - inherit_ctrl_and_mon <=== previously named "match_user", just renamed for consistency with other names
>> When active, kernel and user space use the same CLOSID/RMID. The current status
>> quo for x86.
>> - global_assign_ctrl_inherit_mon
>> When active, CLOSID/control group can be assigned for *all* (hence, "global")
>> kernel work while all kernel work uses same RMID as user space.
>> Can only be supported on architecture where CLOSID and RMID are independent.
>> An arch may support this in hardware (RMID_EN=0?) or this can be done by resctrl during
>> context switch if the RMID is independent and the context switches cost is
>> considered "reasonable".
>> This supports use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>> for PLZA.
>> - global_assign_ctrl_assign_mon
>> When active the same resource group (CLOSID and RMID) can be assigned to
>> *all* kernel work. This could be any group, including the default group.
>> There may not be a use case for this but it could be useful as an intemediate
>> step of the mode that follow (more later).
>> - per_group_assign_ctrl_assign_mon
>> When active every resource group can be associated with another (or the same)
>> resource group. This association maps the resource group for user space work
>> to resource group for kernel work. This is similar to the "kernel_group" idea
>> presented in:
>> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
>> This addresses use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>> for MPAM.
>
> All these new names and related information will go in global structure.
>
> Something like this..
>
> Struct kern_mode {
> enum assoc_mode;
> struct rdtgroup *k_rdtgrp;
> ...
> };
>
> Not sure what other information will be required here. Will know once I stared working on it.
>
> This structure will be updated based on user echo's in "kernel_mode" and "kernel_mode_assignment".
This looks to be a good start. I think keeping the rdtgroup association is good since
it helps to easily display the name to user space while also providing access to the CLOSID
and RMID that is assigned to the tasks.
By placing them in their own structure instead of just globals it does make it easier to
build on when some modes have different requirements wrt rdtgroup management.
You may encounter that certain arrangements work better to support interactions with the
task structure that are not clear at this time.
>
>
>> - Additional values can be added as new requirements arise, for example "per_task"
>> assignment. Connecting visibility of info/kernel_mode_assignment to mode in
>> info/kernel_mode enables resctrl to later support additional modes that may require
>> different configuration files, potentially per-resource group like the "tasks_kernel"
>> (or perhaps rather "kernel_mode_tasks" to have consistent prefix for this feature)
>> and "cpus_kernel" ("kernel_mode_cpus"?) discussed in these threads.
>
> So, per resource group file "kernel_mode_tasks" and "kernel_mode_cpus" are not required right now. Correct?
Correct. The way I see it the baseline implementation to support PLZA should be
straightforward. We'll probably spend a bit extra time on the supporting documentation
to pave the way for possible additions.
>> User can view active and supported modes:
>>
>> # cat info/kernel_mode
>> [inherit_ctrl_and_mon]
>> global_assign_ctrl_inherit_mon
>> global_assign_ctrl_assign_mon
>>
>> User can switch modes:
>> # echo global_assign_ctrl_inherit_mon > kernel_mode
>> # cat kernel_mode
>> inherit_ctrl_and_mon
>> [global_assign_ctrl_inherit_mon]
>> global_assign_ctrl_assign_mon
>>
>>
>> info/kernel_mode_assignment
>> ===========================
>> - Visibility depends on active mode in info/kernel_mode.
>> - Content depends on active mode in info/kernel_mode
>> - Syntax to identify resource groups can use the syntax created as part of earlier ABMC work
>> that supports default group https://lore.kernel.org/lkml/cover.1737577229.git.babu.moger@amd.com/
>> - Default CTRL_MON group and if relevant, the default MON group, can be the default
>> assignment when user just changes the kernel_mode without setting the assignment.
>>
>> info/kernel_mode_assignment when mode is global_assign_ctrl_inherit_mon
>> -----------------------------------------------------------------------
>> - info/kernel_mode_assignment contains single value that is the name of the control group
>> used for all kernel work.
>> - CLOSID/PARTID used for kernel work is determined from the control group assigned
>> - default value is default CTRL_MON group
>> - no monitor group assignment, kernel work inherits user space RMID
>> - syntax is
>> <CTRL_MON group> with "/" meaning default.
>>
>> info/kernel_mode_assignment when mode is global_assign_ctrl_assign_mon
>> -----------------------------------------------------------------------
>> - info/kernel_mode_assignment contains single value that is the name of the resource group
>> used for all kernel work.
>> - Combined CLOSID/RMID or combined PARTID/PMG is set globally to be associated with all
>> kernel work.
>> - default value is default CTRL_MON group
>> - syntax is
>> <CTRL_MON group>/MON group>/ with "//" meaning default control and default monitoring group.
>>
>> info/kernel_mode_assignment when mode is per_group_assign_ctrl_assign_mon
>> -------------------------------------------------------------------------
>> - this presents the information proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
>> within a single file for convenience and potential optimization when user space needs to make changes.
>> Interface proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/ is also an option
>> and as an alternative a per-resource group "kernel_group" can be made visible when user space enables
>> this mode.
>> - info/kernel_mode_assignment contains a mapping of every resource group to another resource group:
>> <resource group for user space work>:<resource group for kernel work>
>> - all resource groups must be present in first field of this file
>> - Even though this is a "per group" setting expectation is that this will set the
>> kernel work CLOSID/RMID for every task. This implies that writing to this file would need
>> to access the tasklist_lock that, when taking for too long, may impact other parts of system.
>> See https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
>
> This mode is currently not supported in AMD PLZA implementation. But we have to keep the options open for future enhancement for MPAM. I am still learning on MPM requirement.
>
>>
>> Scenarios supported
>> ===================
>>
>> Default
>> -------
>> For x86 I understand kernel work and user work to be done with same CLOSID/RMID which
>> implies that info/kernel_mode can always be visible and at least display:
>> # cat info/kernel_mode
>> [inherit_ctrl_and_mon]
>>
>> info/kernel_mode_assignment is not visible in this mode.
>>
>> I understand MPAM may have different defaults here so would like to understand better.
>>
>> Dedicated global allocations for kernel work, monitoring same for user space and kernel (PLZA)
>> ----------------------------------------------------------------------------------------------
>> Possible scenario with PLZA, not MPAM (see later):
>> 1. Create group(s) to manage allocations associated with user space work
>> and assign tasks/CPUs to these groups.
>> 2. Create group to manage allocations associated with all kernel work.
>> - For example,
>> # mkdir /sys/fs/resctrl/unthrottled
>> - No constraints from resctrl fs on interactions with files in this group. From resctrl
>> fs perspective it is not "dedicated" to kernel work but just another resource group.
>
> That is correct. We dont need to handle the group special for kernel_mode while creating the group. However, there will some handling required when kernel_mode group is deleted. We need to move the tasks/cpus back to default group and update the global kernel_mode structure.
Good point, yes.
...
>> Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
>> ----------------------------------------------------------------------------------------------
>> 1. User space creates resource and monitoring groups for user tasks:
>> /sys/fs/resctrl <= User space default allocations
>> /sys/fs/resctrl/g1 <= User space allocations g1
>> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
>> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
>> /sys/fs/resctrl/g2 <= User space allocations g2
>> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
>> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>>
>> 2. User space creates resource and monitoring groups for kernel work (system has two PMG):
>> /sys/fs/resctrl/kernel <= Kernel space allocations
>> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
>> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
>> 3. Set kernel mode to per_group_assign_ctrl_assign_mon:
>> # echo per_group_assign_ctrl_assign_mon > info/kernel_mode
>> - info/kernel_mode_assignment becomes visible and contains
>> # cat info/kernel_mode_assignment
>> //://
>> g1//://
>> g1/g1m1/://
>> g1/g1m2/://
>> g2//://
>> g2/g2m1/://
>> g2/g2m2/://
>> - An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
>> similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
>> avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
>> user space to likely change it.
>> 4. Set groups to be used for kernel work:
>> # echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
>>
>
> Currently, this is not supported in AMD's PLZA implimentation. But we need to keep this option open for MPAM.
Right. I expect PLZA to at least support "global_assign_ctrl_inherit_mon" mode
since that is the one we know somebody is waiting for. I am not actually sure about
"global_assign_ctrl_assign_mon" for PLZA. It is the variant intended to be implemented
by this RFC submission and does not seem difficult to implement but I have not really heard
any requests around it. Please do correct me if I missed anything here.
>
>> The interfaces proposed aim to maintain compatibility with existing user space tools while
>> adding support for all requirements expressed thus far in an efficient way. For an existing
>> user space tool there is no change in meaning of any existing file and no existing known
>> resource group files are made to disappear. There is a global configuration that lets user space
>> manage allocations without needing to check and configure each control group, even per-resource
>> group allocations can be managed from user space with a single read/write to support
>> making changes in most efficient way.
>>
>> What do you think?
>>
>
> I will start planning this work. Feel free to add more details.
> I Will have more questions as I start working on it.
>
> I will separate GMBA work from this work.
>
> Will send both series separately.
>
> Thanks for details and summary.
>
Thank you very much.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE)
2026-02-23 13:21 ` Fenghua Yu
@ 2026-02-23 17:38 ` Reinette Chatre
0 siblings, 0 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-23 17:38 UTC (permalink / raw)
To: Fenghua Yu
Cc: Dave.Martin, akpm, arnd, babu.moger, bhelgaas, bmoger, bp,
bsegall, chang.seok.bae, corbet, dapeng1.mi, dave.hansen,
dietmar.eggemann, elena.reshetova, eranian, feng.tang, fvdl,
gautham.shenoy, hpa, james.morse, juri.lelli, kees, kvm,
linux-doc, linux-kernel, lirongqing, manali.shukla,
mario.limonciello, mgorman, mingo, naveen, pawan.kumar.gupta,
peternewman, peterz, pmladek, rostedt, seanjc, tglx,
thomas.lendacky, tony.luck, vincent.guittot, vschneid, x86, xin
Hi Fenghua,
On 2/23/26 5:21 AM, Fenghua Yu wrote:
> Hi, Reinette,
>
>> What if, instead, it looks something like:
>>
>>info/
>>└── MB/
>> └── resource_schemata/
>> ├── GMB/
>> │ ├── max:4096
>> │ ├── min:1
>> │ ├── resolution:1
>> │ ├── scale:1
>> │ ├── tolerance:0
>> │ ├── type:scalar linear
>> │ └── unit:GBps
>> └── MB/
>> ├── max:8192
>> ├── min:1
>> ├── resolution:8
>> ├── scale:1
>> ├── tolerance:0
>> ├── type:scalar linear
>> └── unit:GBps
>
> May I have 2 comments?
Your comments are always welcome and appreciated.
>
> 1. This directory is for both info and control, right?
Right.
>
> "info" is a read-only directory:
> dr-xr-xr-x 8 root root 0 Feb 23 12:50 info
While "info" is a read-only directory it has contained writable files
since the original monitoring support landed (max_threshold_occupancy)
and has gained more writable files since then.
>
> And its name suggests it's for info only as well.
>
> Instead of mixing legal info and control together, is it better to add a new "control" or "config" directory in /sys/fs/resctrl for this control and info purpose?
While I agree "config" may be a more appropriate name I do not think we are
in a position to change it now. The documentation is clear here with there being
only two sections for resctrl files: "Info directory" and "Resource alloc and monitor groups".
>
> 2. This control method seems only handles global control for resources. But what if a control is per domain and per closid/partid?
The intention of the files within info/<resource>/resource_schemata related to
controls are to describe the control *properties*, not for user space to set control
values using these files.
The values of the controls will continue to be set by user space via the per
closid/partid/resource group "schemata" file. The intention of the info/<resource>/resource_schemata
files is to describe to user space what are valid values for the "schemata" file and
the expectation is that these files (info/<resource>/resource_schemata/*) will
be (at least initially) read-only.
> For example, MPAM has a hardlimit control per mem bandwidth allocation domain per partid. When hardlimit is enabled, MPAM hardware enforces hard limit of MBW max. This can not be controlled globally.
>
> For this kind of per partid per domain control, propose config_schemata/control_schemata file:
>
> partition X/
> control_schemata (or config_schemata):
> MB_hardlimit: 0=0/1;1=0/1;...
>
> Is this reasonable?
Yes, managing HARDLIM as additional schema/control is reasonable.
Exactly how to expose its valid values to user space via info/ files has not
been discussed but I believe the schema description format does support such
extension.
Please see https://lore.kernel.org/lkml/aO0Oazuxt54hQFbx@e133380.arm.com/ for
some example schemata related to HARDLIM.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-23 17:12 ` Reinette Chatre
@ 2026-02-23 22:35 ` Moger, Babu
2026-02-23 23:13 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Moger, Babu @ 2026-02-23 22:35 UTC (permalink / raw)
To: Reinette Chatre, Luck, Tony, Ben Horgan, Moger, Babu,
eranian@google.com
Cc: Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Reinette,
On 2/23/2026 11:12 AM, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/20/26 2:44 PM, Moger, Babu wrote:
>> On 2/19/2026 8:53 PM, Reinette Chatre wrote:
>
>>> Summary of considerations surrounding CLOSID/RMID (PARTID/PMG) assignment for kernel work
>>> =========================================================================================
>>>
>>> - PLZA currently only supports global assignment (only PLZA_EN of
>>> MSR_IA32_PQR_PLZA_ASSOC may differ on logical processors). Even so, current
>>> speculation is that RMID_EN=0 implies that user space RMID is used to monitor
>>> kernel work that could appear to user as "kernel mode" supporting multiple RMIDs.
>>> https://lore.kernel.org/lkml/abb049fa-3a3d-4601-9ae3-61eeb7fd8fcf@amd.com/
>>
>> Yes. RMID_EN=0 means dont use separate RMID for plza.
>
> Thank you very much for confirming.
>
> ...
>
>>> How can resctrl support the requirements?
>>> =========================================
>>>
>>> New global resctrl fs files
>>> ===========================
>>> info/kernel_mode (always visible)
>>> info/kernel_mode_assignment (visibility and content depends on active setting in info/kernel_mode)
>>
>> Probably good idea to drop "assign" for this work. We already have mbm_assign mode and related work.
>
> hmmm ... I think "assign" is generic enough of a word that
> it cannot be claimed by a single feature.
>
>
>> info/kernel_mode_assoc or info/kernel_mode_association? Or We can wait later to rename appropriately.
>
> yes, naming can be settled later.
Sure.
>
>>
>>>
>>> info/kernel_mode
>>> ================
>>> - Displays the currently active as well as possible features available to user
>>> space.
>>> - Single place where user can query "kernel mode" behavior and capabilities of the
>>> system.
>>> - Some possible values:
>>> - inherit_ctrl_and_mon <=== previously named "match_user", just renamed for consistency with other names
>>> When active, kernel and user space use the same CLOSID/RMID. The current status
>>> quo for x86.
>>> - global_assign_ctrl_inherit_mon
>>> When active, CLOSID/control group can be assigned for *all* (hence, "global")
>>> kernel work while all kernel work uses same RMID as user space.
>>> Can only be supported on architecture where CLOSID and RMID are independent.
>>> An arch may support this in hardware (RMID_EN=0?) or this can be done by resctrl during
>>> context switch if the RMID is independent and the context switches cost is
>>> considered "reasonable".
>>> This supports use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>>> for PLZA.
>>> - global_assign_ctrl_assign_mon
>>> When active the same resource group (CLOSID and RMID) can be assigned to
>>> *all* kernel work. This could be any group, including the default group.
>>> There may not be a use case for this but it could be useful as an intemediate
>>> step of the mode that follow (more later).
>>> - per_group_assign_ctrl_assign_mon
>>> When active every resource group can be associated with another (or the same)
>>> resource group. This association maps the resource group for user space work
>>> to resource group for kernel work. This is similar to the "kernel_group" idea
>>> presented in:
>>> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
>>> This addresses use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>>> for MPAM.
>>
>> All these new names and related information will go in global structure.
>>
>> Something like this..
>>
>> Struct kern_mode {
>> enum assoc_mode;
>> struct rdtgroup *k_rdtgrp;
>> ...
>> };
>>
>> Not sure what other information will be required here. Will know once I stared working on it.
>>
>> This structure will be updated based on user echo's in "kernel_mode" and "kernel_mode_assignment".
>
> This looks to be a good start. I think keeping the rdtgroup association is good since
> it helps to easily display the name to user space while also providing access to the CLOSID
> and RMID that is assigned to the tasks.
> By placing them in their own structure instead of just globals it does make it easier to
> build on when some modes have different requirements wrt rdtgroup management.
I am not clear on this comment. Can you please elaborate little bit?
Thanks
Babu
> You may encounter that certain arrangements work better to support interactions with the
> task structure that are not clear at this time.
>
>
>>
>>
>>> - Additional values can be added as new requirements arise, for example "per_task"
>>> assignment. Connecting visibility of info/kernel_mode_assignment to mode in
>>> info/kernel_mode enables resctrl to later support additional modes that may require
>>> different configuration files, potentially per-resource group like the "tasks_kernel"
>>> (or perhaps rather "kernel_mode_tasks" to have consistent prefix for this feature)
>>> and "cpus_kernel" ("kernel_mode_cpus"?) discussed in these threads.
>>
>> So, per resource group file "kernel_mode_tasks" and "kernel_mode_cpus" are not required right now. Correct?
>
> Correct. The way I see it the baseline implementation to support PLZA should be
> straightforward. We'll probably spend a bit extra time on the supporting documentation
> to pave the way for possible additions.
>
>>> User can view active and supported modes:
>>>
>>> # cat info/kernel_mode
>>> [inherit_ctrl_and_mon]
>>> global_assign_ctrl_inherit_mon
>>> global_assign_ctrl_assign_mon
>>>
>>> User can switch modes:
>>> # echo global_assign_ctrl_inherit_mon > kernel_mode
>>> # cat kernel_mode
>>> inherit_ctrl_and_mon
>>> [global_assign_ctrl_inherit_mon]
>>> global_assign_ctrl_assign_mon
>>>
>>>
>>> info/kernel_mode_assignment
>>> ===========================
>>> - Visibility depends on active mode in info/kernel_mode.
>>> - Content depends on active mode in info/kernel_mode
>>> - Syntax to identify resource groups can use the syntax created as part of earlier ABMC work
>>> that supports default group https://lore.kernel.org/lkml/cover.1737577229.git.babu.moger@amd.com/
>>> - Default CTRL_MON group and if relevant, the default MON group, can be the default
>>> assignment when user just changes the kernel_mode without setting the assignment.
>>>
>>> info/kernel_mode_assignment when mode is global_assign_ctrl_inherit_mon
>>> -----------------------------------------------------------------------
>>> - info/kernel_mode_assignment contains single value that is the name of the control group
>>> used for all kernel work.
>>> - CLOSID/PARTID used for kernel work is determined from the control group assigned
>>> - default value is default CTRL_MON group
>>> - no monitor group assignment, kernel work inherits user space RMID
>>> - syntax is
>>> <CTRL_MON group> with "/" meaning default.
>>>
>>> info/kernel_mode_assignment when mode is global_assign_ctrl_assign_mon
>>> -----------------------------------------------------------------------
>>> - info/kernel_mode_assignment contains single value that is the name of the resource group
>>> used for all kernel work.
>>> - Combined CLOSID/RMID or combined PARTID/PMG is set globally to be associated with all
>>> kernel work.
>>> - default value is default CTRL_MON group
>>> - syntax is
>>> <CTRL_MON group>/MON group>/ with "//" meaning default control and default monitoring group.
>>>
>>> info/kernel_mode_assignment when mode is per_group_assign_ctrl_assign_mon
>>> -------------------------------------------------------------------------
>>> - this presents the information proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
>>> within a single file for convenience and potential optimization when user space needs to make changes.
>>> Interface proposed in https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/ is also an option
>>> and as an alternative a per-resource group "kernel_group" can be made visible when user space enables
>>> this mode.
>>> - info/kernel_mode_assignment contains a mapping of every resource group to another resource group:
>>> <resource group for user space work>:<resource group for kernel work>
>>> - all resource groups must be present in first field of this file
>>> - Even though this is a "per group" setting expectation is that this will set the
>>> kernel work CLOSID/RMID for every task. This implies that writing to this file would need
>>> to access the tasklist_lock that, when taking for too long, may impact other parts of system.
>>> See https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
>>
>> This mode is currently not supported in AMD PLZA implementation. But we have to keep the options open for future enhancement for MPAM. I am still learning on MPM requirement.
>>
>>>
>>> Scenarios supported
>>> ===================
>>>
>>> Default
>>> -------
>>> For x86 I understand kernel work and user work to be done with same CLOSID/RMID which
>>> implies that info/kernel_mode can always be visible and at least display:
>>> # cat info/kernel_mode
>>> [inherit_ctrl_and_mon]
>>>
>>> info/kernel_mode_assignment is not visible in this mode.
>>>
>>> I understand MPAM may have different defaults here so would like to understand better.
>>>
>>> Dedicated global allocations for kernel work, monitoring same for user space and kernel (PLZA)
>>> ----------------------------------------------------------------------------------------------
>>> Possible scenario with PLZA, not MPAM (see later):
>>> 1. Create group(s) to manage allocations associated with user space work
>>> and assign tasks/CPUs to these groups.
>>> 2. Create group to manage allocations associated with all kernel work.
>>> - For example,
>>> # mkdir /sys/fs/resctrl/unthrottled
>>> - No constraints from resctrl fs on interactions with files in this group. From resctrl
>>> fs perspective it is not "dedicated" to kernel work but just another resource group.
>>
>> That is correct. We dont need to handle the group special for kernel_mode while creating the group. However, there will some handling required when kernel_mode group is deleted. We need to move the tasks/cpus back to default group and update the global kernel_mode structure.
>
> Good point, yes.
>
>
> ...
>
>>> Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
>>> ----------------------------------------------------------------------------------------------
>>> 1. User space creates resource and monitoring groups for user tasks:
>>> /sys/fs/resctrl <= User space default allocations
>>> /sys/fs/resctrl/g1 <= User space allocations g1
>>> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
>>> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
>>> /sys/fs/resctrl/g2 <= User space allocations g2
>>> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
>>> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>>>
>>> 2. User space creates resource and monitoring groups for kernel work (system has two PMG):
>>> /sys/fs/resctrl/kernel <= Kernel space allocations
>>> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
>>> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
>>> 3. Set kernel mode to per_group_assign_ctrl_assign_mon:
>>> # echo per_group_assign_ctrl_assign_mon > info/kernel_mode
>>> - info/kernel_mode_assignment becomes visible and contains
>>> # cat info/kernel_mode_assignment
>>> //://
>>> g1//://
>>> g1/g1m1/://
>>> g1/g1m2/://
>>> g2//://
>>> g2/g2m1/://
>>> g2/g2m2/://
>>> - An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
>>> similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
>>> avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
>>> user space to likely change it.
>>> 4. Set groups to be used for kernel work:
>>> # echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
>>>
>>
>> Currently, this is not supported in AMD's PLZA implimentation. But we need to keep this option open for MPAM.
>
> Right. I expect PLZA to at least support "global_assign_ctrl_inherit_mon" mode
> since that is the one we know somebody is waiting for. I am not actually sure about
> "global_assign_ctrl_assign_mon" for PLZA. It is the variant intended to be implemented
> by this RFC submission and does not seem difficult to implement but I have not really heard
> any requests around it. Please do correct me if I missed anything here.
>
>>
>>> The interfaces proposed aim to maintain compatibility with existing user space tools while
>>> adding support for all requirements expressed thus far in an efficient way. For an existing
>>> user space tool there is no change in meaning of any existing file and no existing known
>>> resource group files are made to disappear. There is a global configuration that lets user space
>>> manage allocations without needing to check and configure each control group, even per-resource
>>> group allocations can be managed from user space with a single read/write to support
>>> making changes in most efficient way.
>>>
>>> What do you think?
>>>
>>
>> I will start planning this work. Feel free to add more details.
>> I Will have more questions as I start working on it.
>>
>> I will separate GMBA work from this work.
>>
>> Will send both series separately.
>>
>> Thanks for details and summary.
>>
>
> Thank you very much.
>
> Reinette
>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-23 22:35 ` Moger, Babu
@ 2026-02-23 23:13 ` Reinette Chatre
2026-02-24 19:37 ` Babu Moger
0 siblings, 1 reply; 114+ messages in thread
From: Reinette Chatre @ 2026-02-23 23:13 UTC (permalink / raw)
To: Moger, Babu, Luck, Tony, Ben Horgan, Moger, Babu,
eranian@google.com
Cc: Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Babu,
On 2/23/26 2:35 PM, Moger, Babu wrote:
> On 2/23/2026 11:12 AM, Reinette Chatre wrote:
>> On 2/20/26 2:44 PM, Moger, Babu wrote:
>>> On 2/19/2026 8:53 PM, Reinette Chatre wrote:
>>>> info/kernel_mode
>>>> ================
>>>> - Displays the currently active as well as possible features available to user
>>>> space.
>>>> - Single place where user can query "kernel mode" behavior and capabilities of the
>>>> system.
>>>> - Some possible values:
>>>> - inherit_ctrl_and_mon <=== previously named "match_user", just renamed for consistency with other names
>>>> When active, kernel and user space use the same CLOSID/RMID. The current status
>>>> quo for x86.
>>>> - global_assign_ctrl_inherit_mon
>>>> When active, CLOSID/control group can be assigned for *all* (hence, "global")
>>>> kernel work while all kernel work uses same RMID as user space.
>>>> Can only be supported on architecture where CLOSID and RMID are independent.
>>>> An arch may support this in hardware (RMID_EN=0?) or this can be done by resctrl during
>>>> context switch if the RMID is independent and the context switches cost is
>>>> considered "reasonable".
>>>> This supports use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>>>> for PLZA.
>>>> - global_assign_ctrl_assign_mon
>>>> When active the same resource group (CLOSID and RMID) can be assigned to
>>>> *all* kernel work. This could be any group, including the default group.
>>>> There may not be a use case for this but it could be useful as an intemediate
>>>> step of the mode that follow (more later).
>>>> - per_group_assign_ctrl_assign_mon
>>>> When active every resource group can be associated with another (or the same)
>>>> resource group. This association maps the resource group for user space work
>>>> to resource group for kernel work. This is similar to the "kernel_group" idea
>>>> presented in:
>>>> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
>>>> This addresses use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>>>> for MPAM.
>>>
>>> All these new names and related information will go in global structure.
>>>
>>> Something like this..
>>>
>>> Struct kern_mode {
>>> enum assoc_mode;
>>> struct rdtgroup *k_rdtgrp;
>>> ...
>>> };
>>>
>>> Not sure what other information will be required here. Will know once I stared working on it.
>>>
>>> This structure will be updated based on user echo's in "kernel_mode" and "kernel_mode_assignment".
>>
>> This looks to be a good start. I think keeping the rdtgroup association is good since
>> it helps to easily display the name to user space while also providing access to the CLOSID
>> and RMID that is assigned to the tasks.
>> By placing them in their own structure instead of just globals it does make it easier to
>> build on when some modes have different requirements wrt rdtgroup management.
>
> I am not clear on this comment. Can you please elaborate little bit?
I believe what you propose should suffice for the initial support for PLZA. I do not
see the PLZA enabling needing anything more complicated.
As I understand for MPAM support there needs to be more state to track which privilege level
tasks run at.
So, when just considering how MPAM may build on this: The PARTID/PMG to run at when in kernel mode
can be managed per group or per task. In either case I suspect that struct task_struct would need
to include the kernel mode PARTID/PMG to support setting the correct kernel mode PARTID/PMG during
context switching similar to what you coded up in this initial RFC. MPAM may choose to have struct
task_struct be the only place to keep all state about which PARTID/PMG to run when in kernel mode
but I suspect that may result in a lot of lock contention (user space could, for example, be able
to lock up the entire system with a loop reading info/kernel_mode_assignment) so MPAM may choose to
expand the struct kernel_mode introduced by PLZA to, (if kernel mode is managed per group) instead
of one struct rdtgroup * contain a mapping of every resource group to the resource group that should
be used for kernel mode work. This could be some staging/cache used between user space and all the
task structures to help manage the state.
I do not know what MPAM implementation may choose to do but as I see it your proposal
provides a good foundation to build on since it establishes a global place, struct kernel_mode,
where all such state can/should be stored instead of some unspecified group of global variables.
Reinette
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-23 16:38 ` Reinette Chatre
@ 2026-02-24 9:36 ` Ben Horgan
2026-02-24 16:13 ` Reinette Chatre
0 siblings, 1 reply; 114+ messages in thread
From: Ben Horgan @ 2026-02-24 9:36 UTC (permalink / raw)
To: Reinette Chatre, Luck, Tony, Moger, Babu, eranian@google.com
Cc: Moger, Babu, Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Reinette,
On 2/23/26 16:38, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/23/26 2:08 AM, Ben Horgan wrote:
>> On 2/20/26 02:53, Reinette Chatre wrote:
>
> ...
>
>>> Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
>>> ----------------------------------------------------------------------------------------------
>>> 1. User space creates resource and monitoring groups for user tasks:
>>> /sys/fs/resctrl <= User space default allocations
>>> /sys/fs/resctrl/g1 <= User space allocations g1
>>> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
>>> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
>>> /sys/fs/resctrl/g2 <= User space allocations g2
>>> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
>>> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>>>
>>> 2. User space creates resource and monitoring groups for kernel work (system has two PMG):
>>> /sys/fs/resctrl/kernel <= Kernel space allocations
>>> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
>>> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
>>> 3. Set kernel mode to per_group_assign_ctrl_assign_mon:
>>> # echo per_group_assign_ctrl_assign_mon > info/kernel_mode
>>> - info/kernel_mode_assignment becomes visible and contains
>>> # cat info/kernel_mode_assignment
>>> //://
>>> g1//://
>>> g1/g1m1/://
>>> g1/g1m2/://
>>> g2//://
>>> g2/g2m1/://
>>> g2/g2m2/://
>>> - An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
>>> similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
>>> avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
>>> user space to likely change it.
>>> 4. Set groups to be used for kernel work:
>>> # echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
>>
>> Am I right in thinking that you want this in the info directory to avoid
>> adding files to the CTRL_MON/MON groups?
>
> I see this file as providing the same capability as you suggested in
> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/. The reason why I
> presented this as a single file is not because I am trying to avoid adding
> files to the CTRL_MON/MON groups but because I believe such interface enables
> resctrl to have more flexibility and support more scenarios for optimization.
>
> As you mentioned in your proposal the solution enables a single write to move
> a task. As I thought through what resctrl needs to do on such write I saw a lot
> of similarities with mongrp_reparent() that loops through all the tasks via
> for_each_process_thread() while holding tasklist_lock. Issues with mongrp_reparent()
> holding tasklist_lock for a long time are described in [1].
>
> While the single file does not avoid taking tasklist_lock it does give the user the
> ability to set kernel group for multiple user groups with a single write. When user space
> does so I believe it is possible for resctrl to have an optimization that takes tasklist_lock
> just once and make changes to tasks belonging to all groups while looping through all tasks on
> system just once. With files within the CTRL_MON/MON groups setting kernel group for
> multiple user groups will require multiple writes from user space where each write requires
> looping through tasks while holding tasklist_lock during each loop. From what I learned
> from [1] something like this can be very disruptive to the rest of the system.
>
> In summary, I see having this single file provide the same capability as the
> on-file-per-CTRL_MON/MON group since user can choose to set kernel group for user
> group one at a time but it also gives more flexibility to resctrl for optimization.
>
> Nothing is set in stone here. There is still flexibility in this proposal to support
> PARTID and PMG assignment with a single file in each CTRL_MON/MON group if we find that
> it has the more benefits. resctrl can still expose a "per_group_assign_ctrl_assign_mon" mode
> but instead of making "info/kernel_mode_assignment" visible when it is enabled the control file
> in CTRL_MON/MON groups are made visible ... even in this case resctrl could still add the single
> file later if deemed necessary at that time.
>
> Considering all this, do you think resctrl should rather start with a file in each
> CTRL_MON/MON group?
From what you say, it sounds like the optimization opportunities granted
by having a single file will be necessary with some usage patterns and
so I'd be happy to start with just the single
"info/kernel_mode_assignment" file. It does mean that you need to
consider more than the current CTRL_MON directory when reading or
writing configuration but I don't see any real problem there.
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
Thanks,
Ben
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-24 9:36 ` Ben Horgan
@ 2026-02-24 16:13 ` Reinette Chatre
0 siblings, 0 replies; 114+ messages in thread
From: Reinette Chatre @ 2026-02-24 16:13 UTC (permalink / raw)
To: Ben Horgan, Luck, Tony, Moger, Babu, eranian@google.com
Cc: Moger, Babu, Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Ben,
On 2/24/26 1:36 AM, Ben Horgan wrote:
> Hi Reinette,
>
> On 2/23/26 16:38, Reinette Chatre wrote:
>> Hi Ben,
>>
>> On 2/23/26 2:08 AM, Ben Horgan wrote:
>>> On 2/20/26 02:53, Reinette Chatre wrote:
>>
>> ...
>>
>>>> Dedicated global allocations for kernel work, monitoring same for user space and kernel (MPAM)
>>>> ----------------------------------------------------------------------------------------------
>>>> 1. User space creates resource and monitoring groups for user tasks:
>>>> /sys/fs/resctrl <= User space default allocations
>>>> /sys/fs/resctrl/g1 <= User space allocations g1
>>>> /sys/fs/resctrl/g1/mon_groups/g1m1 <= User space monitoring group g1m1
>>>> /sys/fs/resctrl/g1/mon_groups/g1m2 <= User space monitoring group g1m2
>>>> /sys/fs/resctrl/g2 <= User space allocations g2
>>>> /sys/fs/resctrl/g2/mon_groups/g2m1 <= User space monitoring group g2m1
>>>> /sys/fs/resctrl/g2/mon_groups/g2m2 <= User space monitoring group g2m2
>>>>
>>>> 2. User space creates resource and monitoring groups for kernel work (system has two PMG):
>>>> /sys/fs/resctrl/kernel <= Kernel space allocations
>>>> /sys/fs/resctrl/kernel/mon_data <= Kernel space monitoring for all of default and g1
>>>> /sys/fs/resctrl/kernel/mon_groups/kernel_g2 <= Kernel space monitoring for all of g2
>>>> 3. Set kernel mode to per_group_assign_ctrl_assign_mon:
>>>> # echo per_group_assign_ctrl_assign_mon > info/kernel_mode
>>>> - info/kernel_mode_assignment becomes visible and contains
>>>> # cat info/kernel_mode_assignment
>>>> //://
>>>> g1//://
>>>> g1/g1m1/://
>>>> g1/g1m2/://
>>>> g2//://
>>>> g2/g2m1/://
>>>> g2/g2m2/://
>>>> - An optimization here may be to have the change to per_group_assign_ctrl_assign_mon mode be implemented
>>>> similar to the change to global_assign_ctrl_assign_mon that initializes a global default. This can
>>>> avoid keeping tasklist_lock for a long time to set all tasks' kernel CLOSID/RMID to default just for
>>>> user space to likely change it.
>>>> 4. Set groups to be used for kernel work:
>>>> # echo '//:kernel//\ng1//:kernel//\ng1/g1m1/:kernel//\ng1/g1m2/:kernel//\ng2//:kernel/kernel_g2/\ng2/g2m1/:kernel/kernel_g2/\ng2/g2m2/:kernel/kernel_g2/\n' > info/kernel_mode_assignment
>>>
>>> Am I right in thinking that you want this in the info directory to avoid
>>> adding files to the CTRL_MON/MON groups?
>>
>> I see this file as providing the same capability as you suggested in
>> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/. The reason why I
>> presented this as a single file is not because I am trying to avoid adding
>> files to the CTRL_MON/MON groups but because I believe such interface enables
>> resctrl to have more flexibility and support more scenarios for optimization.
>>
>> As you mentioned in your proposal the solution enables a single write to move
>> a task. As I thought through what resctrl needs to do on such write I saw a lot
>> of similarities with mongrp_reparent() that loops through all the tasks via
>> for_each_process_thread() while holding tasklist_lock. Issues with mongrp_reparent()
>> holding tasklist_lock for a long time are described in [1].
>>
>> While the single file does not avoid taking tasklist_lock it does give the user the
>> ability to set kernel group for multiple user groups with a single write. When user space
>> does so I believe it is possible for resctrl to have an optimization that takes tasklist_lock
>> just once and make changes to tasks belonging to all groups while looping through all tasks on
>> system just once. With files within the CTRL_MON/MON groups setting kernel group for
>> multiple user groups will require multiple writes from user space where each write requires
>> looping through tasks while holding tasklist_lock during each loop. From what I learned
>> from [1] something like this can be very disruptive to the rest of the system.
>>
>> In summary, I see having this single file provide the same capability as the
>> on-file-per-CTRL_MON/MON group since user can choose to set kernel group for user
>> group one at a time but it also gives more flexibility to resctrl for optimization.
>>
>> Nothing is set in stone here. There is still flexibility in this proposal to support
>> PARTID and PMG assignment with a single file in each CTRL_MON/MON group if we find that
>> it has the more benefits. resctrl can still expose a "per_group_assign_ctrl_assign_mon" mode
>> but instead of making "info/kernel_mode_assignment" visible when it is enabled the control file
>> in CTRL_MON/MON groups are made visible ... even in this case resctrl could still add the single
>> file later if deemed necessary at that time.
>>
>> Considering all this, do you think resctrl should rather start with a file in each
>> CTRL_MON/MON group?
>
> From what you say, it sounds like the optimization opportunities granted
> by having a single file will be necessary with some usage patterns and
> so I'd be happy to start with just the single
> "info/kernel_mode_assignment" file. It does mean that you need to
> consider more than the current CTRL_MON directory when reading or
> writing configuration but I don't see any real problem there.
When reading the global file it will display all groups, yes. Writing configuration
need only modify the group(s) needing to be modified (similar to schemata file).
Babu and I did speculate a bit on other interactions with "info/kernel_mode_assignment"
in https://lore.kernel.org/lkml/0645bba3-6121-41d4-b627-323faf1089b7@intel.com/ and
resctrl may need to adjust how a task's group membership is managed. resctrl could cache
some state or manage task membership differently entirely like
what Peter proposed in https://lore.kernel.org/lkml/20240325172707.73966-1-peternewman@google.com/
If task group membership management becomes "cheap" then resctrl interface can be
reconsidered.
Reinette
>
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/CALPaoCh0SbG1+VbbgcxjubE7Cc2Pb6QqhG3NH6X=WwsNfqNjtA@mail.gmail.com/
>
> Thanks,
>
> Ben
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling
2026-02-23 23:13 ` Reinette Chatre
@ 2026-02-24 19:37 ` Babu Moger
0 siblings, 0 replies; 114+ messages in thread
From: Babu Moger @ 2026-02-24 19:37 UTC (permalink / raw)
To: Reinette Chatre, Moger, Babu, Luck, Tony, Ben Horgan,
eranian@google.com
Cc: Drew Fustini, corbet@lwn.net, Dave.Martin@arm.com,
james.morse@arm.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, akpm@linux-foundation.org,
pawan.kumar.gupta@linux.intel.com, pmladek@suse.com,
feng.tang@linux.alibaba.com, kees@kernel.org, arnd@arndb.de,
fvdl@google.com, lirongqing@baidu.com, bhelgaas@google.com,
seanjc@google.com, xin@zytor.com, Shukla, Manali,
dapeng1.mi@linux.intel.com, chang.seok.bae@intel.com,
Limonciello, Mario, naveen@kernel.org, elena.reshetova@intel.com,
Lendacky, Thomas, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
peternewman@google.com, Shenoy, Gautham Ranjal
Hi Reinette,
On 2/23/26 17:13, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/23/26 2:35 PM, Moger, Babu wrote:
>> On 2/23/2026 11:12 AM, Reinette Chatre wrote:
>>> On 2/20/26 2:44 PM, Moger, Babu wrote:
>>>> On 2/19/2026 8:53 PM, Reinette Chatre wrote:
>>>>> info/kernel_mode
>>>>> ================
>>>>> - Displays the currently active as well as possible features available to user
>>>>> space.
>>>>> - Single place where user can query "kernel mode" behavior and capabilities of the
>>>>> system.
>>>>> - Some possible values:
>>>>> - inherit_ctrl_and_mon <=== previously named "match_user", just renamed for consistency with other names
>>>>> When active, kernel and user space use the same CLOSID/RMID. The current status
>>>>> quo for x86.
>>>>> - global_assign_ctrl_inherit_mon
>>>>> When active, CLOSID/control group can be assigned for *all* (hence, "global")
>>>>> kernel work while all kernel work uses same RMID as user space.
>>>>> Can only be supported on architecture where CLOSID and RMID are independent.
>>>>> An arch may support this in hardware (RMID_EN=0?) or this can be done by resctrl during
>>>>> context switch if the RMID is independent and the context switches cost is
>>>>> considered "reasonable".
>>>>> This supports use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>>>>> for PLZA.
>>>>> - global_assign_ctrl_assign_mon
>>>>> When active the same resource group (CLOSID and RMID) can be assigned to
>>>>> *all* kernel work. This could be any group, including the default group.
>>>>> There may not be a use case for this but it could be useful as an intemediate
>>>>> step of the mode that follow (more later).
>>>>> - per_group_assign_ctrl_assign_mon
>>>>> When active every resource group can be associated with another (or the same)
>>>>> resource group. This association maps the resource group for user space work
>>>>> to resource group for kernel work. This is similar to the "kernel_group" idea
>>>>> presented in:
>>>>> https://lore.kernel.org/lkml/aYyxAPdTFejzsE42@e134344.arm.com/
>>>>> This addresses use case https://lore.kernel.org/lkml/CABPqkBSq=cgn-am4qorA_VN0vsbpbfDePSi7gubicpROB1=djw@mail.gmail.com/
>>>>> for MPAM.
>>>> All these new names and related information will go in global structure.
>>>>
>>>> Something like this..
>>>>
>>>> Struct kern_mode {
>>>> enum assoc_mode;
>>>> struct rdtgroup *k_rdtgrp;
>>>> ...
>>>> };
>>>>
>>>> Not sure what other information will be required here. Will know once I stared working on it.
>>>>
>>>> This structure will be updated based on user echo's in "kernel_mode" and "kernel_mode_assignment".
>>> This looks to be a good start. I think keeping the rdtgroup association is good since
>>> it helps to easily display the name to user space while also providing access to the CLOSID
>>> and RMID that is assigned to the tasks.
>>> By placing them in their own structure instead of just globals it does make it easier to
>>> build on when some modes have different requirements wrt rdtgroup management.
>> I am not clear on this comment. Can you please elaborate little bit?
> I believe what you propose should suffice for the initial support for PLZA. I do not
> see the PLZA enabling needing anything more complicated.
>
> As I understand for MPAM support there needs to be more state to track which privilege level
> tasks run at.
>
> So, when just considering how MPAM may build on this: The PARTID/PMG to run at when in kernel mode
> can be managed per group or per task. In either case I suspect that struct task_struct would need
> to include the kernel mode PARTID/PMG to support setting the correct kernel mode PARTID/PMG during
> context switching similar to what you coded up in this initial RFC. MPAM may choose to have struct
> task_struct be the only place to keep all state about which PARTID/PMG to run when in kernel mode
> but I suspect that may result in a lot of lock contention (user space could, for example, be able
> to lock up the entire system with a loop reading info/kernel_mode_assignment) so MPAM may choose to
> expand the struct kernel_mode introduced by PLZA to, (if kernel mode is managed per group) instead
> of one struct rdtgroup * contain a mapping of every resource group to the resource group that should
> be used for kernel mode work. This could be some staging/cache used between user space and all the
> task structures to help manage the state.
>
> I do not know what MPAM implementation may choose to do but as I see it your proposal
> provides a good foundation to build on since it establishes a global place, struct kernel_mode,
> where all such state can/should be stored instead of some unspecified group of global variables.
>
Sounds good. Thanks for the clarification.
Thanks
Babu
^ permalink raw reply [flat|nested] 114+ messages in thread
end of thread, other threads:[~2026-02-24 19:37 UTC | newest]
Thread overview: 114+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-21 21:12 [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Babu Moger
2026-01-21 21:12 ` [RFC PATCH 01/19] x86,fs/resctrl: Add support for Global Bandwidth Enforcement (GLBE) Babu Moger
2026-02-09 18:44 ` Reinette Chatre
2026-02-11 1:07 ` Moger, Babu
2026-02-11 16:54 ` Reinette Chatre
2026-02-11 21:18 ` Babu Moger
2026-02-12 3:51 ` Reinette Chatre
2026-02-12 19:09 ` Babu Moger
2026-02-13 0:05 ` Reinette Chatre
2026-02-13 1:51 ` Moger, Babu
2026-02-13 16:17 ` Reinette Chatre
2026-02-13 23:14 ` Moger, Babu
2026-02-14 0:01 ` Reinette Chatre
2026-02-16 16:05 ` Babu Moger
2026-02-20 10:07 ` Ben Horgan
2026-02-20 18:39 ` Reinette Chatre
2026-02-23 9:29 ` Ben Horgan
2026-02-21 0:12 ` Moger, Babu
2026-02-23 13:21 ` Fenghua Yu
2026-02-23 17:38 ` Reinette Chatre
2026-02-23 13:21 ` Fenghua Yu
2026-01-21 21:12 ` [RFC PATCH 02/19] x86,fs/resctrl: Add the resource for Global Memory Bandwidth Allocation Babu Moger
2026-01-21 21:12 ` [RFC PATCH 03/19] fs/resctrl: Add new interface max_bandwidth Babu Moger
2026-02-06 23:58 ` Reinette Chatre
2026-02-09 23:52 ` Moger, Babu
2026-01-21 21:12 ` [RFC PATCH 04/19] fs/resctrl: Add the documentation for Global Memory Bandwidth Allocation Babu Moger
2026-02-03 0:00 ` Luck, Tony
2026-02-03 16:38 ` Babu Moger
2026-02-09 16:32 ` Reinette Chatre
2026-02-10 19:44 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 05/19] x86,fs/resctrl: Add support for Global Slow Memory Bandwidth Allocation (GSMBA) Babu Moger
2026-01-21 21:12 ` [RFC PATCH 06/19] x86,fs/resctrl: Add the resource for Global Slow Memory Bandwidth Enforcement(GLSBE) Babu Moger
2026-01-21 21:12 ` [RFC PATCH 07/19] fs/resctrl: Add the documentation for Global Slow Memory Bandwidth Allocation Babu Moger
2026-01-21 21:12 ` [RFC PATCH 08/19] x86/resctrl: Support Privilege-Level Zero Association (PLZA) Babu Moger
2026-01-21 21:12 ` [RFC PATCH 09/19] x86/resctrl: Add plza_capable in rdt_resource data structure Babu Moger
2026-02-11 15:19 ` Ben Horgan
2026-02-11 16:54 ` Reinette Chatre
2026-02-11 17:48 ` Ben Horgan
2026-02-13 15:50 ` Moger, Babu
2026-01-21 21:12 ` [RFC PATCH 10/19] fs/resctrl: Expose plza_capable via control info file Babu Moger
2026-01-21 21:12 ` [RFC PATCH 11/19] resctrl: Introduce PLZA static key enable/disable helpers Babu Moger
2026-01-21 21:12 ` [RFC PATCH 12/19] x86/resctrl: Add data structures and definitions for PLZA configuration Babu Moger
2026-01-21 21:12 ` [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling Babu Moger
2026-01-27 22:30 ` Luck, Tony
2026-01-28 16:01 ` Moger, Babu
2026-01-28 17:12 ` Luck, Tony
2026-01-28 17:41 ` Moger, Babu
2026-01-28 17:44 ` Moger, Babu
2026-01-28 19:17 ` Luck, Tony
2026-02-10 16:17 ` Reinette Chatre
2026-02-10 18:04 ` Reinette Chatre
2026-02-11 16:40 ` Ben Horgan
2026-02-11 19:46 ` Luck, Tony
2026-02-11 22:22 ` Reinette Chatre
2026-02-12 13:55 ` Ben Horgan
2026-02-12 18:37 ` Reinette Chatre
2026-02-16 15:18 ` Ben Horgan
2026-02-17 18:51 ` Reinette Chatre
2026-02-17 21:44 ` Luck, Tony
2026-02-17 22:37 ` Reinette Chatre
2026-02-17 22:52 ` Luck, Tony
2026-02-17 23:55 ` Reinette Chatre
2026-02-18 16:44 ` Luck, Tony
2026-02-19 17:03 ` Luck, Tony
2026-02-19 17:45 ` Ben Horgan
2026-02-20 8:21 ` Drew Fustini
2026-02-19 17:33 ` Ben Horgan
2026-02-20 2:53 ` Reinette Chatre
2026-02-20 22:44 ` Moger, Babu
2026-02-23 17:12 ` Reinette Chatre
2026-02-23 22:35 ` Moger, Babu
2026-02-23 23:13 ` Reinette Chatre
2026-02-24 19:37 ` Babu Moger
2026-02-23 10:08 ` Ben Horgan
2026-02-23 16:38 ` Reinette Chatre
2026-02-24 9:36 ` Ben Horgan
2026-02-24 16:13 ` Reinette Chatre
2026-02-19 11:06 ` Ben Horgan
2026-02-19 18:12 ` Luck, Tony
2026-02-19 18:36 ` Reinette Chatre
2026-02-19 10:21 ` Ben Horgan
2026-02-19 18:14 ` Reinette Chatre
2026-02-23 9:48 ` Ben Horgan
2026-02-13 16:37 ` Moger, Babu
2026-02-13 17:02 ` Luck, Tony
2026-02-16 19:24 ` Babu Moger
2026-02-14 0:10 ` Reinette Chatre
2026-02-16 15:41 ` Ben Horgan
2026-02-16 22:52 ` Moger, Babu
2026-02-17 15:56 ` Ben Horgan
2026-02-17 16:38 ` Babu Moger
2026-02-18 9:54 ` Ben Horgan
2026-02-18 6:22 ` Stephane Eranian
2026-02-18 9:35 ` Ben Horgan
2026-02-19 10:27 ` Ben Horgan
2026-02-16 22:36 ` Moger, Babu
2026-02-12 10:00 ` Ben Horgan
2026-01-21 21:12 ` [RFC PATCH 14/19] x86,fs/resctrl: Add the functionality to configure PLZA Babu Moger
2026-01-29 19:13 ` Luck, Tony
2026-01-29 19:53 ` Babu Moger
2026-01-21 21:12 ` [RFC PATCH 15/19] fs/resctrl: Introduce PLZA attribute in rdtgroup interface Babu Moger
2026-01-21 21:12 ` [RFC PATCH 16/19] fs/resctrl: Implement rdtgroup_plza_write() to configure PLZA in a group Babu Moger
2026-01-28 22:03 ` Luck, Tony
2026-01-29 18:54 ` Luck, Tony
2026-01-29 19:31 ` Babu Moger
2026-01-29 19:42 ` Babu Moger
2026-02-10 0:05 ` Reinette Chatre
2026-02-11 23:10 ` Moger, Babu
2026-01-21 21:12 ` [RFC PATCH 17/19] fs/resctrl: Update PLZA configuration when cpu_mask changes Babu Moger
2026-01-21 21:12 ` [RFC PATCH 18/19] x86/resctrl: Refactor show_rdt_tasks() to support PLZA task matching Babu Moger
2026-01-21 21:12 ` [RFC PATCH 19/19] fs/resctrl: Add per-task PLZA enable support via rdtgroup Babu Moger
2026-02-03 19:58 ` [RFC PATCH 00/19] x86,fs/resctrl: Support for Global Bandwidth Enforcement and Priviledge Level Zero Association Luck, Tony
2026-02-10 0:27 ` Reinette Chatre
2026-02-11 0:40 ` Drew Fustini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox