[PATCH v8 0/7] x86/resctrl: mba

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement
@ 2024-10-29 17:28 Tony Luck
  2024-10-29 17:28 ` [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control Tony Luck
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

Add support to choose the memory monitor bandwidth event independently
for each ctrl_mon group when resctrl is mounted with the mba_MBps
option. Users may want this for applications that are not localized to
NUMA boundaries.  Default behavior still uses local memory bandwidth
when that event is supported by the platform.

Side benefit[0]: Systems that do not support the local bandwidth monitor
event but do support the total bandwidth event can now use the mba_MBps
mount option.

Changes since v7[1]:
--------------------

Almost a complete rewrite based on the new user ABI of a file
in each ctrl_mon group to select the event instead of a mount
option that applies to all groups.

Some of the code from the v7 patch0001 was salvaged and is now
split between patches 0002/0003 in this series. Patch 0002
addresses comments from Reinette[2] with additional sanity
checks, use of WARN_ON_ONCE() and early return from functions
where these checks fail.

I moved the refactor of mbm_update() to a separate patch to
make it easier to review the changes to compute bandwidth for
all memory bandwidth events.

Signed-off-by: Tony Luck <tony.luck@intel.com>

[0] My original objective!
[1] https://lore.kernel.org/all/20241003191228.67541-1-tony.luck@intel.com
[2] https://lore.kernel.org/all/bb30835f-5be9-44b4-8544-2f528e7fc573@intel.com/

Tony Luck (7):
  x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control
  x86/resctrl: Compute memory bandwidth for all supported events
  x86/resctrl: Refactor mbm_update()
  x86/resctrl: Relax checks for mba_MBps mount option
  x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories
  x86/resctrl: Add write option to "mba_MBps_event" file
  x86/resctrl: Document the new "mba_MBps_event" file

 Documentation/arch/x86/resctrl.rst        | 10 +++
 include/linux/resctrl.h                   |  2 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  6 ++
 arch/x86/kernel/cpu/resctrl/core.c        |  5 ++
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 71 ++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c     | 91 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 28 ++++++-
 7 files changed, 167 insertions(+), 46 deletions(-)

base-commit: 81983758430957d9a5cb3333fe324fd70cf63e7e
-- 
2.47.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control
  2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
@ 2024-10-29 17:28 ` Tony Luck
  2024-11-01 22:03   ` Fenghua Yu
  2024-11-12 19:24   ` Reinette Chatre
  2024-10-29 17:28 ` [PATCH v8 2/7] x86/resctrl: Compute memory bandwidth for all supported events Tony Luck
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

Resctrl uses local memory bandwidth event as input to the feedback
loop when the mba_MBps mount option is used. This means that this
mount option cannot be used on systems that only support monitoring
of total bandwidth.

Prepare to allow users to choose the input event independently for
each ctrl_mon group.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                | 2 ++
 arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
 arch/x86/kernel/cpu/resctrl/core.c     | 5 +++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++++++
 4 files changed, 15 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index d94abba1c716..fd05b937e2f4 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -49,6 +49,8 @@ enum resctrl_event_id {
 	QOS_L3_MBM_LOCAL_EVENT_ID	= 0x03,
 };
 
+extern enum resctrl_event_id mba_mbps_default_event;
+
 /**
  * struct resctrl_staged_config - parsed configuration to be applied
  * @new_ctrl:		new ctrl value to be loaded
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 955999aecfca..a6f051fb2e69 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -283,6 +283,7 @@ struct pseudo_lock_region {
  *				monitor only or ctrl_mon group
  * @mon:			mongroup related data
  * @mode:			mode of resource group
+ * @mba_mbps_event:		input event id when mba_sc mode is active
  * @plr:			pseudo-locked region
  */
 struct rdtgroup {
@@ -295,6 +296,7 @@ struct rdtgroup {
 	enum rdt_group_type		type;
 	struct mongroup			mon;
 	enum rdtgrp_mode		mode;
+	enum resctrl_event_id		mba_mbps_event;
 	struct pseudo_lock_region	*plr;
 };
 
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b681c2e07dbf..5b55a7ac7013 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -958,6 +958,11 @@ static __init bool get_rdt_mon_resources(void)
 	if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
 		rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
 
+	if (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID))
+		mba_mbps_default_event = QOS_L3_MBM_LOCAL_EVENT_ID;
+	else if (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID))
+		mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
+
 	if (!rdt_mon_features)
 		return false;
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index d7163b764c62..dbfb9d11f3f8 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -65,6 +65,8 @@ static void rdtgroup_destroy_root(void);
 
 struct dentry *debugfs_resctrl;
 
+enum resctrl_event_id mba_mbps_default_event;
+
 static bool resctrl_debug;
 
 void rdt_last_cmd_clear(void)
@@ -2665,6 +2667,8 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (ret)
 		goto out_schemata_free;
 
+	rdtgroup_default.mba_mbps_event = mba_mbps_default_event;
+
 	kernfs_activate(rdtgroup_default.kn);
 
 	ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
@@ -3624,6 +3628,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
 		}
 	}
 
+	rdtgrp->mba_mbps_event = mba_mbps_default_event;
+
 	goto out_unlock;
 
 out_del_list:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 2/7] x86/resctrl: Compute memory bandwidth for all supported events
  2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
  2024-10-29 17:28 ` [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control Tony Luck
@ 2024-10-29 17:28 ` Tony Luck
  2024-11-12 19:25   ` Reinette Chatre
  2024-10-29 17:28 ` [PATCH v8 3/7] x86/resctrl: Refactor mbm_update() Tony Luck
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

Computing the bandwidth for an event is cheap, and only done once
per second. Doing so simplifies switching between events and allows
choosing different events per ctrl_mon group.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 38 ++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 851b561850e0..3ef339e405c2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -663,9 +663,12 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
  */
 static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
 {
-	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
-	struct mbm_state *m = &rr->d->mbm_local[idx];
 	u64 cur_bw, bytes, cur_bytes;
+	struct mbm_state *m;
+
+	m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+	if (WARN_ON_ONCE(!m))
+		return;
 
 	cur_bytes = rr->val;
 	bytes = cur_bytes - m->prev_bw_bytes;
@@ -752,20 +755,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
 	struct rdt_ctrl_domain *dom_mba;
+	enum resctrl_event_id evt_id;
 	struct rdt_resource *r_mba;
-	u32 cur_bw, user_bw, idx;
 	struct list_head *head;
 	struct rdtgroup *entry;
+	u32 cur_bw, user_bw;
 
-	if (!is_mbm_local_enabled())
+	if (!is_mbm_enabled())
 		return;
 
 	r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
+	evt_id = rgrp->mba_mbps_event;
+
+	if (WARN_ON_ONCE(!is_mbm_event(evt_id)))
+		return;
+	if (WARN_ON_ONCE(evt_id == QOS_L3_MBM_LOCAL_EVENT_ID && !is_mbm_local_enabled()))
+		return;
+	if (WARN_ON_ONCE(evt_id == QOS_L3_MBM_TOTAL_EVENT_ID && !is_mbm_total_enabled()))
+		return;
+
 
 	closid = rgrp->closid;
 	rmid = rgrp->mon.rmid;
-	idx = resctrl_arch_rmid_idx_encode(closid, rmid);
-	pmbm_data = &dom_mbm->mbm_local[idx];
+	pmbm_data = get_mbm_state(dom_mbm, closid, rmid, evt_id);
+	if (WARN_ON_ONCE(!pmbm_data))
+		return;
 
 	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
@@ -784,7 +798,9 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 	 */
 	head = &rgrp->mon.crdtgrp_list;
 	list_for_each_entry(entry, head, mon.crdtgrp_list) {
-		cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+		cmbm_data = get_mbm_state(dom_mbm, entry->closid, entry->mon.rmid, evt_id);
+		if (WARN_ON_ONCE(!cmbm_data))
+			return;
 		cur_bw += cmbm_data->prev_bw;
 	}
 
@@ -837,6 +853,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
 
 		__mon_event_count(closid, rmid, &rr);
 
+		/*
+		 * Call the MBA software controller only for the
+		 * control groups and when user has enabled
+		 * the software controller explicitly.
+		 */
+		if (is_mba_sc(NULL))
+			mbm_bw_count(closid, rmid, &rr);
+
 		resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
 	}
 	if (is_mbm_local_enabled()) {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
  2024-10-29 17:28 ` [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control Tony Luck
  2024-10-29 17:28 ` [PATCH v8 2/7] x86/resctrl: Compute memory bandwidth for all supported events Tony Luck
@ 2024-10-29 17:28 ` Tony Luck
  2024-11-01 22:08   ` Fenghua Yu
  2024-11-13 22:25   ` Reinette Chatre
  2024-10-29 17:28 ` [PATCH v8 4/7] x86/resctrl: Relax checks for mba_MBps mount option Tony Luck
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

Computing memory bandwidth for all enabled events resulted in
identical code blocks for total and local bandwidth in mbm_update().

Refactor with a helper function to eliminate code duplication.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 69 ++++++++++-----------------
 1 file changed, 24 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3ef339e405c2..1b6cb3bbc008 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -829,62 +829,41 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
-		       u32 closid, u32 rmid)
+static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+				 u32 closid, u32 rmid, enum resctrl_event_id evtid)
 {
 	struct rmid_read rr = {0};
 
 	rr.r = r;
 	rr.d = d;
+	rr.evtid = evtid;
+	rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+	if (IS_ERR(rr.arch_mon_ctx)) {
+		pr_warn_ratelimited("Failed to allocate monitor context: %ld",
+				    PTR_ERR(rr.arch_mon_ctx));
+		return;
+	}
+
+	__mon_event_count(closid, rmid, &rr);
+
+	if (is_mba_sc(NULL))
+		mbm_bw_count(closid, rmid, &rr);
+
+	resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
+}
 
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
+		       u32 closid, u32 rmid)
+{
 	/*
 	 * This is protected from concurrent reads from user
 	 * as both the user and we hold the global mutex.
 	 */
-	if (is_mbm_total_enabled()) {
-		rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
-		rr.val = 0;
-		rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
-		if (IS_ERR(rr.arch_mon_ctx)) {
-			pr_warn_ratelimited("Failed to allocate monitor context: %ld",
-					    PTR_ERR(rr.arch_mon_ctx));
-			return;
-		}
-
-		__mon_event_count(closid, rmid, &rr);
-
-		/*
-		 * Call the MBA software controller only for the
-		 * control groups and when user has enabled
-		 * the software controller explicitly.
-		 */
-		if (is_mba_sc(NULL))
-			mbm_bw_count(closid, rmid, &rr);
-
-		resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
-	}
-	if (is_mbm_local_enabled()) {
-		rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
-		rr.val = 0;
-		rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
-		if (IS_ERR(rr.arch_mon_ctx)) {
-			pr_warn_ratelimited("Failed to allocate monitor context: %ld",
-					    PTR_ERR(rr.arch_mon_ctx));
-			return;
-		}
-
-		__mon_event_count(closid, rmid, &rr);
-
-		/*
-		 * Call the MBA software controller only for the
-		 * control groups and when user has enabled
-		 * the software controller explicitly.
-		 */
-		if (is_mba_sc(NULL))
-			mbm_bw_count(closid, rmid, &rr);
+	if (is_mbm_total_enabled())
+		mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_TOTAL_EVENT_ID);
 
-		resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
-	}
+	if (is_mbm_local_enabled())
+		mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_LOCAL_EVENT_ID);
 }
 
 /*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 4/7] x86/resctrl: Relax checks for mba_MBps mount option
  2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
                   ` (2 preceding siblings ...)
  2024-10-29 17:28 ` [PATCH v8 3/7] x86/resctrl: Refactor mbm_update() Tony Luck
@ 2024-10-29 17:28 ` Tony Luck
  2024-10-29 17:28 ` [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories Tony Luck
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

This option may be used with any memory bandwidth monitoring event.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index dbfb9d11f3f8..5034a3dd0430 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2345,7 +2345,7 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
-	return (is_mbm_local_enabled() &&
+	return (is_mbm_enabled() &&
 		r->alloc_capable && is_mba_linear() &&
 		r->ctrl_scope == rmbm->mon_scope);
 }
@@ -2772,7 +2772,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		ctx->enable_cdpl2 = true;
 		return 0;
 	case Opt_mba_mbps:
-		msg = "mba_MBps requires local MBM and linear scale MBA at L3 scope";
+		msg = "mba_MBps requires MBM and linear scale MBA at L3 scope";
 		if (!supports_mba_mbps())
 			return invalfc(fc, msg);
 		ctx->enable_mba_mbps = true;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories
  2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
                   ` (3 preceding siblings ...)
  2024-10-29 17:28 ` [PATCH v8 4/7] x86/resctrl: Relax checks for mba_MBps mount option Tony Luck
@ 2024-10-29 17:28 ` Tony Luck
  2024-11-12 22:12   ` Reinette Chatre
  2024-10-29 17:28 ` [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file Tony Luck
  2024-10-29 17:28 ` [PATCH v8 7/7] x86/resctrl: Document the new " Tony Luck
  6 siblings, 1 reply; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

When the mba_MBps mount option is used, provide a file in each
ctrl_mon directory to show which memory monitoring event is
being used.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h    |  2 ++
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 25 +++++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 17 +++++++++++++++
 3 files changed, 44 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a6f051fb2e69..5f3438ca9e2b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -609,6 +609,8 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
+int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
+				 struct seq_file *s, void *v);
 bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 200d89a64027..b9ba419e5c88 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -518,6 +518,31 @@ static int smp_mon_event_count(void *arg)
 	return 0;
 }
 
+int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
+				 struct seq_file *s, void *v)
+{
+	struct rdtgroup *rdtgrp;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+
+	if (rdtgrp) {
+		switch (rdtgrp->mba_mbps_event) {
+		case QOS_L3_MBM_LOCAL_EVENT_ID:
+			seq_puts(s, "mbm_local_bytes\n");
+			break;
+		case QOS_L3_MBM_TOTAL_EVENT_ID:
+			seq_puts(s, "mbm_total_bytes\n");
+			break;
+		case QOS_L3_OCCUP_EVENT_ID:
+			break;
+		}
+	}
+
+	rdtgroup_kn_unlock(of->kn);
+
+	return 0;
+}
+
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    cpumask_t *cpumask, int evtid, int first)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5034a3dd0430..3ba81963e981 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1943,6 +1943,12 @@ static struct rftype res_common_files[] = {
 		.seq_show	= rdtgroup_schemata_show,
 		.fflags		= RFTYPE_CTRL_BASE,
 	},
+	{
+		.name		= "mba_MBps_event",
+		.mode		= 0644,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdtgroup_mba_mbps_event_show,
+	},
 	{
 		.name		= "mode",
 		.mode		= 0644,
@@ -2042,6 +2048,15 @@ void __init mbm_config_rftype_init(const char *config)
 		rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE;
 }
 
+static void mba_mbps_event_init(bool enable)
+{
+	struct rftype *rft;
+
+	rft = rdtgroup_get_rftype_by_name("mba_MBps_event");
+	if (rft)
+		rft->fflags = enable ? RFTYPE_CTRL_BASE : 0;
+}
+
 /**
  * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file
  * @r: The resource group with which the file is associated.
@@ -2371,6 +2386,8 @@ static int set_mba_sc(bool mba_sc)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
 
+	mba_mbps_event_init(mba_sc);
+
 	return 0;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
                   ` (4 preceding siblings ...)
  2024-10-29 17:28 ` [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories Tony Luck
@ 2024-10-29 17:28 ` Tony Luck
  2024-11-01 23:26   ` Fenghua Yu
  2024-11-12 22:18   ` Reinette Chatre
  2024-10-29 17:28 ` [PATCH v8 7/7] x86/resctrl: Document the new " Tony Luck
  6 siblings, 2 replies; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

A user can choose any of the memory bandwidth monitoring events
listed in /sys/fs/resctrl/info/L3_mon/mon_features independently
for each ctrl_mon group by writing to the "mba_MBps_event" file.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h    |  2 +
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 46 +++++++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  1 +
 3 files changed, 49 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 5f3438ca9e2b..35483c6615b6 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -609,6 +609,8 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
+ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes, loff_t off);
 int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
 				 struct seq_file *s, void *v);
 bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b9ba419e5c88..fc5585dc688f 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -518,6 +518,52 @@ static int smp_mon_event_count(void *arg)
 	return 0;
 }
 
+ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes, loff_t off)
+{
+	struct rdtgroup *rdtgrp;
+	int ret = 0;
+
+	/* Valid input requires a trailing newline */
+	if (nbytes == 0 || buf[nbytes - 1] != '\n')
+		return -EINVAL;
+	buf[nbytes - 1] = '\0';
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (!rdtgrp) {
+		rdtgroup_kn_unlock(of->kn);
+		return -ENOENT;
+	}
+	rdt_last_cmd_clear();
+
+	if (!strcmp(buf, "mbm_local_bytes")) {
+		if (is_mbm_local_enabled())
+			rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
+		else
+			ret = -ENXIO;
+	} else if (!strcmp(buf, "mbm_total_bytes")) {
+		if (is_mbm_total_enabled())
+			rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
+		else
+			ret = -ENXIO;
+	} else {
+		ret = -EINVAL;
+	}
+
+	switch (ret) {
+	case -ENXIO:
+		rdt_last_cmd_printf("Unsupported event id '%s'\n", buf);
+		break;
+	case -EINVAL:
+		rdt_last_cmd_printf("Unknown event id '%s'\n", buf);
+		break;
+	}
+
+	rdtgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
 				 struct seq_file *s, void *v)
 {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 3ba81963e981..6fa501ef187f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1947,6 +1947,7 @@ static struct rftype res_common_files[] = {
 		.name		= "mba_MBps_event",
 		.mode		= 0644,
 		.kf_ops		= &rdtgroup_kf_single_ops,
+		.write		= rdtgroup_mba_mbps_event_write,
 		.seq_show	= rdtgroup_mba_mbps_event_show,
 	},
 	{
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 7/7] x86/resctrl: Document the new "mba_MBps_event" file
  2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
                   ` (5 preceding siblings ...)
  2024-10-29 17:28 ` [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file Tony Luck
@ 2024-10-29 17:28 ` Tony Luck
  2024-11-12 22:25   ` Reinette Chatre
  6 siblings, 1 reply; 32+ messages in thread
From: Tony Luck @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches,
	Tony Luck

New read/write file to show/set the memory bandwidth event used
to control bandwidth used by each ctrl_mon group.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/arch/x86/resctrl.rst | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a824affd741d..d86081e76bbf 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -384,6 +384,16 @@ When monitoring is enabled all MON groups will also contain:
 	Available only with debug option. The identifier used by hardware
 	for the monitor group. On x86 this is the RMID.
 
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+	Reading this file shows which memory bandwidth event is used
+	as input to a software feedback loop to keep memory bandwidth
+	below the value specified in the schemata file. Writing the
+	name of one of the supported memory bandwidth events found in
+	/sys/fs/resctrl/info/L3_MON/monfeatures will change the input
+	event.
+
 Resource allocation rules
 -------------------------
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control
  2024-10-29 17:28 ` [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control Tony Luck
@ 2024-11-01 22:03   ` Fenghua Yu
  2024-11-01 22:40     ` Tony Luck
  2024-11-12 19:24   ` Reinette Chatre
  1 sibling, 1 reply; 32+ messages in thread
From: Fenghua Yu @ 2024-11-01 22:03 UTC (permalink / raw)
  To: Tony Luck, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi, Tony,

On 10/29/24 10:28, Tony Luck wrote:
> Resctrl uses local memory bandwidth event as input to the feedback
> loop when the mba_MBps mount option is used. This means that this
> mount option cannot be used on systems that only support monitoring
> of total bandwidth.
> 
> Prepare to allow users to choose the input event independently for
> each ctrl_mon group.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   include/linux/resctrl.h                | 2 ++
>   arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
>   arch/x86/kernel/cpu/resctrl/core.c     | 5 +++++
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++++++
>   4 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index d94abba1c716..fd05b937e2f4 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -49,6 +49,8 @@ enum resctrl_event_id {
>   	QOS_L3_MBM_LOCAL_EVENT_ID	= 0x03,
>   };
>   
> +extern enum resctrl_event_id mba_mbps_default_event;
> +
>   /**
>    * struct resctrl_staged_config - parsed configuration to be applied
>    * @new_ctrl:		new ctrl value to be loaded
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 955999aecfca..a6f051fb2e69 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -283,6 +283,7 @@ struct pseudo_lock_region {
>    *				monitor only or ctrl_mon group
>    * @mon:			mongroup related data
>    * @mode:			mode of resource group
> + * @mba_mbps_event:		input event id when mba_sc mode is active
>    * @plr:			pseudo-locked region
>    */
>   struct rdtgroup {
> @@ -295,6 +296,7 @@ struct rdtgroup {
>   	enum rdt_group_type		type;
>   	struct mongroup			mon;
>   	enum rdtgrp_mode		mode;
> +	enum resctrl_event_id		mba_mbps_event;
>   	struct pseudo_lock_region	*plr;
>   };
>   
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index b681c2e07dbf..5b55a7ac7013 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -958,6 +958,11 @@ static __init bool get_rdt_mon_resources(void)
>   	if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
>   		rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
>   
> +	if (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID))

Please change this check to:

	if (is_mbm_local_enabled())

> +		mba_mbps_default_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> +	else if (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID))

and this check to:

	else if (is_mbm_total_enabled())

> +		mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> +
>   	if (!rdt_mon_features)
>   		return false;
>   
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index d7163b764c62..dbfb9d11f3f8 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -65,6 +65,8 @@ static void rdtgroup_destroy_root(void);
>   
>   struct dentry *debugfs_resctrl;
>   
> +enum resctrl_event_id mba_mbps_default_event;
> +
>   static bool resctrl_debug;
>   
>   void rdt_last_cmd_clear(void)
> @@ -2665,6 +2667,8 @@ static int rdt_get_tree(struct fs_context *fc)
>   	if (ret)
>   		goto out_schemata_free;
>   
> +	rdtgroup_default.mba_mbps_event = mba_mbps_default_event;
> +
>   	kernfs_activate(rdtgroup_default.kn);
>   
>   	ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
> @@ -3624,6 +3628,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>   		}
>   	}
>   
> +	rdtgrp->mba_mbps_event = mba_mbps_default_event;
> +
>   	goto out_unlock;
>   
>   out_del_list:

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-10-29 17:28 ` [PATCH v8 3/7] x86/resctrl: Refactor mbm_update() Tony Luck
@ 2024-11-01 22:08   ` Fenghua Yu
  2024-11-01 22:57     ` Tony Luck
  2024-11-13 22:25   ` Reinette Chatre
  1 sibling, 1 reply; 32+ messages in thread
From: Fenghua Yu @ 2024-11-01 22:08 UTC (permalink / raw)
  To: Tony Luck, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi, Tony,

On 10/29/24 10:28, Tony Luck wrote:
> Computing memory bandwidth for all enabled events resulted in
> identical code blocks for total and local bandwidth in mbm_update().
> 
> Refactor with a helper function to eliminate code duplication.
> 
> No functional change.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/monitor.c | 69 ++++++++++-----------------
>   1 file changed, 24 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3ef339e405c2..1b6cb3bbc008 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -829,62 +829,41 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
>   	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
>   }
>   
> -static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
> -		       u32 closid, u32 rmid)
> +static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
> +				 u32 closid, u32 rmid, enum resctrl_event_id evtid)
>   {
>   	struct rmid_read rr = {0};
>   
>   	rr.r = r;
>   	rr.d = d;
> +	rr.evtid = evtid;
> +	rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
> +	if (IS_ERR(rr.arch_mon_ctx)) {
> +		pr_warn_ratelimited("Failed to allocate monitor context: %ld",
> +				    PTR_ERR(rr.arch_mon_ctx));
> +		return;
> +	}
> +
> +	__mon_event_count(closid, rmid, &rr);
> +

The comment (added in patch 2 and legacy code) is removed completely here.

Maybe it's better to remain the comment here?

> +	if (is_mba_sc(NULL))
> +		mbm_bw_count(closid, rmid, &rr);
> +
> +	resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
> +}
>   
> +static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
> +		       u32 closid, u32 rmid)
> +{
>   	/*
>   	 * This is protected from concurrent reads from user
>   	 * as both the user and we hold the global mutex.
>   	 */
> -	if (is_mbm_total_enabled()) {
> -		rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> -		rr.val = 0;
> -		rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
> -		if (IS_ERR(rr.arch_mon_ctx)) {
> -			pr_warn_ratelimited("Failed to allocate monitor context: %ld",
> -					    PTR_ERR(rr.arch_mon_ctx));
> -			return;
> -		}
> -
> -		__mon_event_count(closid, rmid, &rr);
> -
> -		/*
> -		 * Call the MBA software controller only for the
> -		 * control groups and when user has enabled
> -		 * the software controller explicitly.
> -		 */

Same comment was added in patch 2 but is removed in the helper. Maybe 
it's better to add back the comment in the helper.

> -		if (is_mba_sc(NULL))
> -			mbm_bw_count(closid, rmid, &rr);
> -
> -		resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
> -	}
> -	if (is_mbm_local_enabled()) {
> -		rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> -		rr.val = 0;
> -		rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
> -		if (IS_ERR(rr.arch_mon_ctx)) {
> -			pr_warn_ratelimited("Failed to allocate monitor context: %ld",
> -					    PTR_ERR(rr.arch_mon_ctx));
> -			return;
> -		}
> -
> -		__mon_event_count(closid, rmid, &rr);
> -
> -		/*
> -		 * Call the MBA software controller only for the
> -		 * control groups and when user has enabled
> -		 * the software controller explicitly.
> -		 */

This same comment is in legacy code and is removed in the helper.

> -		if (is_mba_sc(NULL))
> -			mbm_bw_count(closid, rmid, &rr);
> +	if (is_mbm_total_enabled())
> +		mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_TOTAL_EVENT_ID);
>   
> -		resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
> -	}
> +	if (is_mbm_local_enabled())
> +		mbm_update_one_event(r, d, closid, rmid, QOS_L3_MBM_LOCAL_EVENT_ID);
>   }
>   
>   /*

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control
  2024-11-01 22:03   ` Fenghua Yu
@ 2024-11-01 22:40     ` Tony Luck
  0 siblings, 0 replies; 32+ messages in thread
From: Tony Luck @ 2024-11-01 22:40 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Reinette Chatre, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

On Fri, Nov 01, 2024 at 03:03:32PM -0700, Fenghua Yu wrote:
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index b681c2e07dbf..5b55a7ac7013 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -958,6 +958,11 @@ static __init bool get_rdt_mon_resources(void)
> >   	if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
> >   		rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
> > +	if (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID))
> 
> Please change this check to:
> 
> 	if (is_mbm_local_enabled())
> 
> > +		mba_mbps_default_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> > +	else if (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID))
> 
> and this check to:
> 
> 	else if (is_mbm_total_enabled())
> 
> > +		mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> > +

You are correct. I had a moment of amnesia and forgot those
helpers existed and just pasted the expressions used to set
each of the bits in rdt_mon_features.

I will change as you suggest.

-Tony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-11-01 22:08   ` Fenghua Yu
@ 2024-11-01 22:57     ` Tony Luck
  0 siblings, 0 replies; 32+ messages in thread
From: Tony Luck @ 2024-11-01 22:57 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Reinette Chatre, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

On Fri, Nov 01, 2024 at 03:08:27PM -0700, Fenghua Yu wrote:
> > -		/*
> > -		 * Call the MBA software controller only for the
> > -		 * control groups and when user has enabled
> > -		 * the software controller explicitly.
> > -		 */
> 
> Same comment was added in patch 2 but is removed in the helper. Maybe it's
> better to add back the comment in the helper.
> 
> > -		if (is_mba_sc(NULL))
> > -			mbm_bw_count(closid, rmid, &rr);

In patch 2 I cut & pasted the comment along with the code from
the if (is_mbm_local()) {...}  clause into the if (is_mbm_total()) { ... }
clause. Maybe to make it more obvious that the code was duplicated.

But I had doubts about the usefulness/accuracy when making the
helper function.

Breaking the comment into pieces:

    "Call the MBA software controller"

This code isn't calling the "controller". To me that's the
update_mba_bw() function that adjusts the MBA values. The
call to mbm_bw_count() is simply computing the bandwidth.

    "only for the control groups"

while this is true, the test in the code here "if (is_mba_sc(NULL))"
isn't making any checks about control groups.

    "and when user has enabled the software controller"

Accurate.

    "explicitly"

Redundant (or confusing, there is no "implicit" way to enable
the s/w controller).

If a comment is needed (and I'm not convinced that it is)
maybe something like:

	/*
	 * If the software controller controller is enabled, compute the
	 * bandwidth for this event id.
	 */

-Tony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-10-29 17:28 ` [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file Tony Luck
@ 2024-11-01 23:26   ` Fenghua Yu
  2024-11-01 23:55     ` Luck, Tony
  2024-11-12 22:18   ` Reinette Chatre
  1 sibling, 1 reply; 32+ messages in thread
From: Fenghua Yu @ 2024-11-01 23:26 UTC (permalink / raw)
  To: Tony Luck, Reinette Chatre, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi, Tony,

On 10/29/24 10:28, Tony Luck wrote:
> A user can choose any of the memory bandwidth monitoring events
> listed in /sys/fs/resctrl/info/L3_mon/mon_features independently
> for each ctrl_mon group by writing to the "mba_MBps_event" file.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/internal.h    |  2 +
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 46 +++++++++++++++++++++++
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  1 +
>   3 files changed, 49 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 5f3438ca9e2b..35483c6615b6 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -609,6 +609,8 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>   				char *buf, size_t nbytes, loff_t off);
>   int rdtgroup_schemata_show(struct kernfs_open_file *of,
>   			   struct seq_file *s, void *v);
> +ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
> +				      char *buf, size_t nbytes, loff_t off);
>   int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
>   				 struct seq_file *s, void *v);
>   bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index b9ba419e5c88..fc5585dc688f 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -518,6 +518,52 @@ static int smp_mon_event_count(void *arg)
>   	return 0;
>   }
>   
> +ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
> +				      char *buf, size_t nbytes, loff_t off)
> +{
> +	struct rdtgroup *rdtgrp;
> +	int ret = 0;
> +
> +	/* Valid input requires a trailing newline */
> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
> +		return -EINVAL;
> +	buf[nbytes - 1] = '\0';
> +
> +	rdtgrp = rdtgroup_kn_lock_live(of->kn);
> +	if (!rdtgrp) {
> +		rdtgroup_kn_unlock(of->kn);
> +		return -ENOENT;
> +	}
> +	rdt_last_cmd_clear();
> +
> +	if (!strcmp(buf, "mbm_local_bytes")) {
> +		if (is_mbm_local_enabled())
> +			rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> +		else
> +			ret = -ENXIO;
> +	} else if (!strcmp(buf, "mbm_total_bytes")) {
> +		if (is_mbm_total_enabled())
> +			rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;


User may think each time toggling the local/total event will effect MBA. 
And they may create usage case like frequently changing the events to 
maintain/adjust both total and local within bw boundary.

But toggling mba_mbps_event faster than 1sec doesn't have any effect on 
MBA SC because MBA SC is called every one second.

Maybe need to add a ratelimit of 1 second on calling this function? And 
adding info in the document that toggling speed should be slower than 1 
second?

> +		else
> +			ret = -ENXIO;
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	switch (ret) {
> +	case -ENXIO:
> +		rdt_last_cmd_printf("Unsupported event id '%s'\n", buf);
> +		break;
> +	case -EINVAL:
> +		rdt_last_cmd_printf("Unknown event id '%s'\n", buf);
> +		break;
> +	}
> +
> +	rdtgroup_kn_unlock(of->kn);
> +
> +	return ret ?: nbytes;
> +}
> +
>   int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
>   				 struct seq_file *s, void *v)
>   {
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 3ba81963e981..6fa501ef187f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1947,6 +1947,7 @@ static struct rftype res_common_files[] = {
>   		.name		= "mba_MBps_event",
>   		.mode		= 0644,
>   		.kf_ops		= &rdtgroup_kf_single_ops,
> +		.write		= rdtgroup_mba_mbps_event_write,
>   		.seq_show	= rdtgroup_mba_mbps_event_show,
>   	},
>   	{

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-11-01 23:26   ` Fenghua Yu
@ 2024-11-01 23:55     ` Luck, Tony
  2024-11-02  0:57       ` Fenghua Yu
  0 siblings, 1 reply; 32+ messages in thread
From: Luck, Tony @ 2024-11-01 23:55 UTC (permalink / raw)
  To: Yu, Fenghua, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

> > +   if (!strcmp(buf, "mbm_local_bytes")) {
> > +           if (is_mbm_local_enabled())
> > +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> > +           else
> > +                   ret = -ENXIO;
> > +   } else if (!strcmp(buf, "mbm_total_bytes")) {
> > +           if (is_mbm_total_enabled())
> > +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
>
>
> User may think each time toggling the local/total event will effect MBA.
> And they may create usage case like frequently changing the events to
> maintain/adjust both total and local within bw boundary.
>
> But toggling mba_mbps_event faster than 1sec doesn't have any effect on
> MBA SC because MBA SC is called every one second.
>
> Maybe need to add a ratelimit of 1 second on calling this function? And
> adding info in the document that toggling speed should be slower than 1
> second?

The limit would need to be per ctrl_mon group, not on calls to this function.
It's perfectly ok to switch multiple groups in a short interval.

I'm not sure how to rate limit here. I could add a delay so that the write()
call blocks until enough time passes before making the change. But
what should I do if a user submits more writes to the file? Queue them
all and apply at one second intervals?

Maybe it would be better to just to add some additional text to the
documentation pointing out that resctrl only checks bandwidth once
per second to make throttling adjustments. So changes to the event
will only have effect after some seconds have passed?

-Tony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-11-01 23:55     ` Luck, Tony
@ 2024-11-02  0:57       ` Fenghua Yu
  2024-11-12 22:00         ` Reinette Chatre
  0 siblings, 1 reply; 32+ messages in thread
From: Fenghua Yu @ 2024-11-02  0:57 UTC (permalink / raw)
  To: Luck, Tony, Chatre, Reinette, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

Hi, Tony,

On 11/1/24 16:55, Luck, Tony wrote:
>>> +   if (!strcmp(buf, "mbm_local_bytes")) {
>>> +           if (is_mbm_local_enabled())
>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
>>> +           else
>>> +                   ret = -ENXIO;
>>> +   } else if (!strcmp(buf, "mbm_total_bytes")) {
>>> +           if (is_mbm_total_enabled())
>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
>>
>>
>> User may think each time toggling the local/total event will effect MBA.
>> And they may create usage case like frequently changing the events to
>> maintain/adjust both total and local within bw boundary.
>>
>> But toggling mba_mbps_event faster than 1sec doesn't have any effect on
>> MBA SC because MBA SC is called every one second.
>>
>> Maybe need to add a ratelimit of 1 second on calling this function? And
>> adding info in the document that toggling speed should be slower than 1
>> second?
> 
> The limit would need to be per ctrl_mon group, not on calls to this function.
> It's perfectly ok to switch multiple groups in a short interval.

Agree.

> 
> I'm not sure how to rate limit here. I could add a delay so that the write()
> call blocks until enough time passes before making the change. But
> what should I do if a user submits more writes to the file? Queue them
> all and apply at one second intervals?

Maybe define "mba_mbps_last_time" in rdtgroup. Then

if (time_before(jiffies, rdtgrp->mba_mbps_last_time + HZ) {
	rdt_last_cmd_printf("Too fast (>1/s) mba_MBps event change)\n");
         rdtgroup_kn_unlock(of->kn);
	return -EAGAIN;
}
rdtgrp->mba_mbps_last_time = jiffies;

> 
> Maybe it would be better to just to add some additional text to the
> documentation pointing out that resctrl only checks bandwidth once
> per second to make throttling adjustments. So changes to the event
> will only have effect after some seconds have passed?


Add additional text would be great.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control
  2024-10-29 17:28 ` [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control Tony Luck
  2024-11-01 22:03   ` Fenghua Yu
@ 2024-11-12 19:24   ` Reinette Chatre
  1 sibling, 0 replies; 32+ messages in thread
From: Reinette Chatre @ 2024-11-12 19:24 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On 10/29/24 10:28 AM, Tony Luck wrote:
> Resctrl uses local memory bandwidth event as input to the feedback
> loop when the mba_MBps mount option is used. This means that this
> mount option cannot be used on systems that only support monitoring
> of total bandwidth.
> 
> Prepare to allow users to choose the input event independently for
> each ctrl_mon group.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h                | 2 ++
>  arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
>  arch/x86/kernel/cpu/resctrl/core.c     | 5 +++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++++++
>  4 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index d94abba1c716..fd05b937e2f4 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -49,6 +49,8 @@ enum resctrl_event_id {
>  	QOS_L3_MBM_LOCAL_EVENT_ID	= 0x03,
>  };
>  
> +extern enum resctrl_event_id mba_mbps_default_event;
> +
>  /**
>   * struct resctrl_staged_config - parsed configuration to be applied
>   * @new_ctrl:		new ctrl value to be loaded
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 955999aecfca..a6f051fb2e69 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -283,6 +283,7 @@ struct pseudo_lock_region {
>   *				monitor only or ctrl_mon group
>   * @mon:			mongroup related data
>   * @mode:			mode of resource group
> + * @mba_mbps_event:		input event id when mba_sc mode is active

Referring to mba_sc as a "mode" is potentially confusing when considering the
line right above it (specifically, mba_sc is not a possible value of @mode).
How about "input monitoring event id when MBA software controller (mba_sc) is enabled"
or "input monitoring event id when mba_sc is enabled"?

>   * @plr:			pseudo-locked region
>   */
>  struct rdtgroup {
> @@ -295,6 +296,7 @@ struct rdtgroup {
>  	enum rdt_group_type		type;
>  	struct mongroup			mon;
>  	enum rdtgrp_mode		mode;
> +	enum resctrl_event_id		mba_mbps_event;
>  	struct pseudo_lock_region	*plr;
>  };
>  
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index b681c2e07dbf..5b55a7ac7013 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -958,6 +958,11 @@ static __init bool get_rdt_mon_resources(void)
>  	if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
>  		rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
>  
> +	if (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID))
> +		mba_mbps_default_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> +	else if (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID))
> +		mba_mbps_default_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> +
>  	if (!rdt_mon_features)
>  		return false;

Shouldn't checking of individual monitoring features be at this point? That is,
after confirming that there are indeed monitoring features?

fyi ... since these checks are not architectural I expect that they will
move to resctrl_mon_resource_init() as part of the resctrl fs/arch split.
https://lore.kernel.org/all/20241004180347.19985-17-james.morse@arm.com/
This move will be supported when using the helpers as Fenghua suggested.

>  
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index d7163b764c62..dbfb9d11f3f8 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -65,6 +65,8 @@ static void rdtgroup_destroy_root(void);
>  
>  struct dentry *debugfs_resctrl;
>  
> +enum resctrl_event_id mba_mbps_default_event;
> +
>  static bool resctrl_debug;
>  
>  void rdt_last_cmd_clear(void)
> @@ -2665,6 +2667,8 @@ static int rdt_get_tree(struct fs_context *fc)
>  	if (ret)
>  		goto out_schemata_free;
>  
> +	rdtgroup_default.mba_mbps_event = mba_mbps_default_event;
> +

The pattern of the default group properties are to initialize them once
in rdtgroup_setup_default() and then do any needed reset in rdt_kill_sb().
You can compare with how rdtgroup_default.mode is handled.
If existing pattern is not acceptable then it could be reworked to have one pattern
support all scenarios instead of fragmenting the default group's initialization and
reset.

I do notice that the current MPAM proposal I linked above calls 
rdtgroup_setup_default() before resctrl_mon_resource_init() so I think that
needs to be swapped to support this feature.

>  	kernfs_activate(rdtgroup_default.kn);
>  
>  	ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
> @@ -3624,6 +3628,8 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
>  		}
>  	}
>  
> +	rdtgrp->mba_mbps_event = mba_mbps_default_event;
> +

Should this be in the "if (resctrl_arch_mon_capable())" section? Looking at rdtgroup_default
I also do not see any check for whether monitoring is supported. Is this intended? 
mba_mbps_default_event is only initialized if monitoring is supported so it *seems* as though
the value of 0, which is not a valid value of enum resctrl_event_id is intended as an
"uninitialized"? Unclear how to interpret this implementation at this point. Reading
the changelog again there are no details about these subtleties.

>  	goto out_unlock;
>  
>  out_del_list:

Reinette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 2/7] x86/resctrl: Compute memory bandwidth for all supported events
  2024-10-29 17:28 ` [PATCH v8 2/7] x86/resctrl: Compute memory bandwidth for all supported events Tony Luck
@ 2024-11-12 19:25   ` Reinette Chatre
  0 siblings, 0 replies; 32+ messages in thread
From: Reinette Chatre @ 2024-11-12 19:25 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On 10/29/24 10:28 AM, Tony Luck wrote:
> Computing the bandwidth for an event is cheap, and only done once
> per second. Doing so simplifies switching between events and allows
> choosing different events per ctrl_mon group.
> 

If I understand correctly this changelog only describes the first and the
last hunks of this patch. Further, it is the last hunk that introduces the
duplicate code that prompts the refactoring in next patch "x86/resctrl:
Refactor mbm_update()". Inserting the duplicate code and then refactoring it
out does not seem necessary. What if the second hunk of this patch is extracted
into its own patch and the refactoring squashed with what remains?
In the cover letter your motivation for the separate refactor was "to make it
easier to review", but I wonder if separating the unrelated code in second
hunk would make this separate refactor unnecessary while remaining easy to
review?

Reinette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-11-02  0:57       ` Fenghua Yu
@ 2024-11-12 22:00         ` Reinette Chatre
  2024-11-12 23:57           ` Luck, Tony
  0 siblings, 1 reply; 32+ messages in thread
From: Reinette Chatre @ 2024-11-12 22:00 UTC (permalink / raw)
  To: Fenghua Yu, Luck, Tony, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

Hi Fenghua and Tony,

On 11/1/24 5:57 PM, Fenghua Yu wrote:
> Hi, Tony,
> 
> On 11/1/24 16:55, Luck, Tony wrote:
>>>> +   if (!strcmp(buf, "mbm_local_bytes")) {
>>>> +           if (is_mbm_local_enabled())
>>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
>>>> +           else
>>>> +                   ret = -ENXIO;
>>>> +   } else if (!strcmp(buf, "mbm_total_bytes")) {
>>>> +           if (is_mbm_total_enabled())
>>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
>>>
>>>
>>> User may think each time toggling the local/total event will effect MBA.
>>> And they may create usage case like frequently changing the events to
>>> maintain/adjust both total and local within bw boundary.
>>>
>>> But toggling mba_mbps_event faster than 1sec doesn't have any effect on
>>> MBA SC because MBA SC is called every one second.
>>>
>>> Maybe need to add a ratelimit of 1 second on calling this function? And
>>> adding info in the document that toggling speed should be slower than 1
>>> second?
>>
>> The limit would need to be per ctrl_mon group, not on calls to this function.
>> It's perfectly ok to switch multiple groups in a short interval.
> 
> Agree.
> 
>>
>> I'm not sure how to rate limit here. I could add a delay so that the write()
>> call blocks until enough time passes before making the change. But
>> what should I do if a user submits more writes to the file? Queue them
>> all and apply at one second intervals?
> 
> Maybe define "mba_mbps_last_time" in rdtgroup. Then
> 
> if (time_before(jiffies, rdtgrp->mba_mbps_last_time + HZ) {
>     rdt_last_cmd_printf("Too fast (>1/s) mba_MBps event change)\n");
>         rdtgroup_kn_unlock(of->kn);
>     return -EAGAIN;
> }
> rdtgrp->mba_mbps_last_time = jiffies;

This seems like enforcing an unnecessary limitation. For example, this would mean
that user space needs to wait for a second to undo a change to the monitoring event
in case there was a mistake. This seems like an unnecessary restriction to me.

I am also afraid that there may be some corner cases where a write to the file and
the actual run of the overflow handler  (on every domain) may not be exactly HZ apart.

Bandwidth allocation remains to be adjusted in either direction with at most the bandwidth
granularity. A user attempting to toggle bandwidth event cannot expecting large
changes in allocated bandwidth even if the events differ significantly.

Surely we can explore more if we learn about a specific use case.

>> Maybe it would be better to just to add some additional text to the
>> documentation pointing out that resctrl only checks bandwidth once
>> per second to make throttling adjustments. So changes to the event
>> will only have effect after some seconds have passed?
> 
> 
> Add additional text would be great.


Agreed.

Reinette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories
  2024-10-29 17:28 ` [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories Tony Luck
@ 2024-11-12 22:12   ` Reinette Chatre
  2024-11-12 23:42     ` Luck, Tony
  0 siblings, 1 reply; 32+ messages in thread
From: Reinette Chatre @ 2024-11-12 22:12 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On 10/29/24 10:28 AM, Tony Luck wrote:
> When the mba_MBps mount option is used, provide a file in each
> ctrl_mon directory to show which memory monitoring event is
> being used.

Could the changelog be expanded a bit more to inform reader what
the monitoring event is used for?

I would also like to remind about the expectations documented in
"Changelog" section of Documentation/process/maintainer-tip.rst:
"A good structure is to explain the context, the problem and the solution
in separate paragraphs and this order."

> 
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h    |  2 ++
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 25 +++++++++++++++++++++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 17 +++++++++++++++
>  3 files changed, 44 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a6f051fb2e69..5f3438ca9e2b 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -609,6 +609,8 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off);
>  int rdtgroup_schemata_show(struct kernfs_open_file *of,
>  			   struct seq_file *s, void *v);
> +int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
> +				 struct seq_file *s, void *v);
>  bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
>  			   unsigned long cbm, int closid, bool exclusive);
>  unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 200d89a64027..b9ba419e5c88 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -518,6 +518,31 @@ static int smp_mon_event_count(void *arg)
>  	return 0;
>  }
>  
> +int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
> +				 struct seq_file *s, void *v)
> +{
> +	struct rdtgroup *rdtgrp;
> +
> +	rdtgrp = rdtgroup_kn_lock_live(of->kn);
> +
> +	if (rdtgrp) {
> +		switch (rdtgrp->mba_mbps_event) {
> +		case QOS_L3_MBM_LOCAL_EVENT_ID:
> +			seq_puts(s, "mbm_local_bytes\n");
> +			break;
> +		case QOS_L3_MBM_TOTAL_EVENT_ID:
> +			seq_puts(s, "mbm_total_bytes\n");
> +			break;
> +		case QOS_L3_OCCUP_EVENT_ID:
> +			break;

Having a value of QOS_L3_OCCUP_EVENT_ID would surely be a kernel bug.
What do you think of a WARN_ON_ONCE()/pr_warn_once() here?

If mba_mbps_event is indeed expected to have a value of "0" to
reflect "uninitialized" then it could also be handled here
to catch any kernel bugs.

> +		}

The custom is to return -ENOENT if no rdtgrp.

> +	}
> +
> +	rdtgroup_kn_unlock(of->kn);
> +
> +	return 0;
> +}
> +
>  void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
>  		    cpumask_t *cpumask, int evtid, int first)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5034a3dd0430..3ba81963e981 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1943,6 +1943,12 @@ static struct rftype res_common_files[] = {
>  		.seq_show	= rdtgroup_schemata_show,
>  		.fflags		= RFTYPE_CTRL_BASE,
>  	},
> +	{
> +		.name		= "mba_MBps_event",
> +		.mode		= 0644,

Please only support writing to file when appropriate callback exists.

> +		.kf_ops		= &rdtgroup_kf_single_ops,
> +		.seq_show	= rdtgroup_mba_mbps_event_show,
> +	},
>  	{
>  		.name		= "mode",
>  		.mode		= 0644,
> @@ -2042,6 +2048,15 @@ void __init mbm_config_rftype_init(const char *config)
>  		rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE;
>  }
>  
> +static void mba_mbps_event_init(bool enable)

fyi ...
https://lore.kernel.org/all/237409fb566288d9f3dc7568385e6488b62dbba0.1730244116.git.babu.moger@amd.com/

> +{
> +	struct rftype *rft;
> +
> +	rft = rdtgroup_get_rftype_by_name("mba_MBps_event");
> +	if (rft)
> +		rft->fflags = enable ? RFTYPE_CTRL_BASE : 0;

I think this sets this file to be created for all CTRL groups, even when not supporting
monitoring?

> +}
> +
>  /**
>   * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file
>   * @r: The resource group with which the file is associated.
> @@ -2371,6 +2386,8 @@ static int set_mba_sc(bool mba_sc)
>  			d->mbps_val[i] = MBA_MAX_MBPS;
>  	}
>  
> +	mba_mbps_event_init(mba_sc);
> +
>  	return 0;
>  }
>  

Reinette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-10-29 17:28 ` [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file Tony Luck
  2024-11-01 23:26   ` Fenghua Yu
@ 2024-11-12 22:18   ` Reinette Chatre
  1 sibling, 0 replies; 32+ messages in thread
From: Reinette Chatre @ 2024-11-12 22:18 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On 10/29/24 10:28 AM, Tony Luck wrote:
> A user can choose any of the memory bandwidth monitoring events
> listed in /sys/fs/resctrl/info/L3_mon/mon_features independently
> for each ctrl_mon group by writing to the "mba_MBps_event" file.

Please review the changelog based on tip requirements. Folks used to
tip customs may expect changelog to start with the context, while the
above reads like it describes current context it describes what this patch
makes possible without ever getting to description of the change itself.

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h    |  2 +
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 46 +++++++++++++++++++++++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  1 +
>  3 files changed, 49 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 5f3438ca9e2b..35483c6615b6 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -609,6 +609,8 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off);
>  int rdtgroup_schemata_show(struct kernfs_open_file *of,
>  			   struct seq_file *s, void *v);
> +ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
> +				      char *buf, size_t nbytes, loff_t off);
>  int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
>  				 struct seq_file *s, void *v);
>  bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index b9ba419e5c88..fc5585dc688f 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -518,6 +518,52 @@ static int smp_mon_event_count(void *arg)
>  	return 0;
>  }
>  
> +ssize_t rdtgroup_mba_mbps_event_write(struct kernfs_open_file *of,
> +				      char *buf, size_t nbytes, loff_t off)
> +{
> +	struct rdtgroup *rdtgrp;
> +	int ret = 0;
> +
> +	/* Valid input requires a trailing newline */
> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
> +		return -EINVAL;
> +	buf[nbytes - 1] = '\0';
> +
> +	rdtgrp = rdtgroup_kn_lock_live(of->kn);
> +	if (!rdtgrp) {
> +		rdtgroup_kn_unlock(of->kn);
> +		return -ENOENT;
> +	}
> +	rdt_last_cmd_clear();
> +
> +	if (!strcmp(buf, "mbm_local_bytes")) {
> +		if (is_mbm_local_enabled())
> +			rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> +		else
> +			ret = -ENXIO;
> +	} else if (!strcmp(buf, "mbm_total_bytes")) {
> +		if (is_mbm_total_enabled())
> +			rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> +		else
> +			ret = -ENXIO;
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	switch (ret) {
> +	case -ENXIO:
> +		rdt_last_cmd_printf("Unsupported event id '%s'\n", buf);
> +		break;
> +	case -EINVAL:
> +		rdt_last_cmd_printf("Unknown event id '%s'\n", buf);
> +		break;
> +	}

nit: What is the benefit of distinguishing between these cases in messaging to
user space? I am thinking of the scenario where user may write "llc_occupancy",
this will print "Unknown event id", which is technically not correct since it is
"known", it is just not "supported". Perhaps the default error message can just
always be "Unsupported"?

> +
> +	rdtgroup_kn_unlock(of->kn);
> +
> +	return ret ?: nbytes;
> +}
> +
>  int rdtgroup_mba_mbps_event_show(struct kernfs_open_file *of,
>  				 struct seq_file *s, void *v)
>  {
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 3ba81963e981..6fa501ef187f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1947,6 +1947,7 @@ static struct rftype res_common_files[] = {
>  		.name		= "mba_MBps_event",
>  		.mode		= 0644,

Writable bits can be set in this commit.

>  		.kf_ops		= &rdtgroup_kf_single_ops,
> +		.write		= rdtgroup_mba_mbps_event_write,
>  		.seq_show	= rdtgroup_mba_mbps_event_show,
>  	},
>  	{


Reinette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 7/7] x86/resctrl: Document the new "mba_MBps_event" file
  2024-10-29 17:28 ` [PATCH v8 7/7] x86/resctrl: Document the new " Tony Luck
@ 2024-11-12 22:25   ` Reinette Chatre
  0 siblings, 0 replies; 32+ messages in thread
From: Reinette Chatre @ 2024-11-12 22:25 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On 10/29/24 10:28 AM, Tony Luck wrote:
> New read/write file to show/set the memory bandwidth event used
> to control bandwidth used by each ctrl_mon group.

I expected at least a verb to summarize what this patch does.

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  Documentation/arch/x86/resctrl.rst | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a824affd741d..d86081e76bbf 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -384,6 +384,16 @@ When monitoring is enabled all MON groups will also contain:
>  	Available only with debug option. The identifier used by hardware
>  	for the monitor group. On x86 this is the RMID.
>  
> +When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
> +
> +"mba_MBps_event":
> +	Reading this file shows which memory bandwidth event is used
> +	as input to a software feedback loop to keep memory bandwidth

Since there is only one "software feedback loop" it could be specific:

	" ... input to the software feedback loop that keeps memory bandwidth ..."

> +	below the value specified in the schemata file. Writing the
> +	name of one of the supported memory bandwidth events found in
> +	/sys/fs/resctrl/info/L3_MON/monfeatures will change the input

"monfeatures" -> "mon_features"
"will change the input event" -> "changes the input event"

(changes to document expectations wrt timing to be included)

> +	event.
> +
>  Resource allocation rules
>  -------------------------
>  

Reinette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories
  2024-11-12 22:12   ` Reinette Chatre
@ 2024-11-12 23:42     ` Luck, Tony
  2024-11-13  0:20       ` Reinette Chatre
  0 siblings, 1 reply; 32+ messages in thread
From: Luck, Tony @ 2024-11-12 23:42 UTC (permalink / raw)
  To: Chatre, Reinette, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

> > +static int set_mba_sc(bool mba_sc)
> > +{
> > +   struct rftype *rft;
> > +
> > +   rft = rdtgroup_get_rftype_by_name("mba_MBps_event");
> > +   if (rft)
> > +           rft->fflags = enable ? RFTYPE_CTRL_BASE : 0;
>
> I think this sets this file to be created for all CTRL groups, even when not supporting
> monitoring?

No. The calling sequence is:

rdt_get_tree()
    rdt_enable_ctx()

        if (ctx->enable_mba_mbps) {
                ret = set_mba_sc(true);
                if (ret)
                        goto out_cdpl3;
        }

So set_mba_sc() is only called when the mba_MBps mount option has been used. So
mba_mbps_event_init() isn't called.

Note that on unmount of the resctrl file system rdt_kill_sb() calls rdt_disable_ctx()
which calls set_mba_sc(false) which will clear rft->fflags on systems which support
mba_MBps.

-Tony


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-11-12 22:00         ` Reinette Chatre
@ 2024-11-12 23:57           ` Luck, Tony
  2024-11-13  0:40             ` Reinette Chatre
  0 siblings, 1 reply; 32+ messages in thread
From: Luck, Tony @ 2024-11-12 23:57 UTC (permalink / raw)
  To: Chatre, Reinette, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

> >>>> +   if (!strcmp(buf, "mbm_local_bytes")) {
> >>>> +           if (is_mbm_local_enabled())
> >>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> >>>> +           else
> >>>> +                   ret = -ENXIO;
> >>>> +   } else if (!strcmp(buf, "mbm_total_bytes")) {
> >>>> +           if (is_mbm_total_enabled())
> >>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> >>>
> >>>
> >>> User may think each time toggling the local/total event will effect MBA.
> >>> And they may create usage case like frequently changing the events to
> >>> maintain/adjust both total and local within bw boundary.
> >>>
> >>> But toggling mba_mbps_event faster than 1sec doesn't have any effect on
> >>> MBA SC because MBA SC is called every one second.
> >>>
> >>> Maybe need to add a ratelimit of 1 second on calling this function? And
> >>> adding info in the document that toggling speed should be slower than 1
> >>> second?
> >>
> >> The limit would need to be per ctrl_mon group, not on calls to this function.
> >> It's perfectly ok to switch multiple groups in a short interval.
> >
> > Agree.
> >
> >>
> >> I'm not sure how to rate limit here. I could add a delay so that the write()
> >> call blocks until enough time passes before making the change. But
> >> what should I do if a user submits more writes to the file? Queue them
> >> all and apply at one second intervals?
> >
> > Maybe define "mba_mbps_last_time" in rdtgroup. Then
> >
> > if (time_before(jiffies, rdtgrp->mba_mbps_last_time + HZ) {
> >     rdt_last_cmd_printf("Too fast (>1/s) mba_MBps event change)\n");
> >         rdtgroup_kn_unlock(of->kn);
> >     return -EAGAIN;
> > }
> > rdtgrp->mba_mbps_last_time = jiffies;
>
> This seems like enforcing an unnecessary limitation. For example, this would mean
> that user space needs to wait for a second to undo a change to the monitoring event
> in case there was a mistake. This seems like an unnecessary restriction to me.
>
> I am also afraid that there may be some corner cases where a write to the file and
> the actual run of the overflow handler  (on every domain) may not be exactly HZ apart.
>
> Bandwidth allocation remains to be adjusted in either direction with at most the bandwidth
> granularity. A user attempting to toggle bandwidth event cannot expecting large
> changes in allocated bandwidth even if the events differ significantly.
>
> Surely we can explore more if we learn about a specific use case.

Note that the kernel generally doesn't prevent the user from doing inane things
that do not cause damage.  E.g. a user can already abuse the legacy memory
percentage bandwidth controls with

while :
do
	echo "MB:0=100;1=10" > schemata
	sleep 0.1
	echo "MB:0=10;1=100" > schemata
	sleep 0.1
done

Similar abuse of the percentage mode is also possible.

>
> >> Maybe it would be better to just to add some additional text to the
> >> documentation pointing out that resctrl only checks bandwidth once
> >> per second to make throttling adjustments. So changes to the event
> >> will only have effect after some seconds have passed?
> >
> >
> > Add additional text would be great.
>
>
> Agreed.

We hadn't previously documented the rate at which mba_sc measured and adjusted
the memory bandwidth allocation controls. Once per second is currently an implementation
detail that in theory could be changed in the future.

I'm reluctant to carve that value into the stone of resctrl.rst, But without a specific
value "don't change the values too rapidly" is meaningless.

-Tony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories
  2024-11-12 23:42     ` Luck, Tony
@ 2024-11-13  0:20       ` Reinette Chatre
  2024-11-13  0:53         ` Luck, Tony
  0 siblings, 1 reply; 32+ messages in thread
From: Reinette Chatre @ 2024-11-13  0:20 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 11/12/24 3:42 PM, Luck, Tony wrote:
>>> +static int set_mba_sc(bool mba_sc)
>>> +{
>>> +   struct rftype *rft;
>>> +
>>> +   rft = rdtgroup_get_rftype_by_name("mba_MBps_event");
>>> +   if (rft)
>>> +           rft->fflags = enable ? RFTYPE_CTRL_BASE : 0;
>>
>> I think this sets this file to be created for all CTRL groups, even when not supporting
>> monitoring?
> 
> No. The calling sequence is:
> 
> rdt_get_tree()
>     rdt_enable_ctx()
> 
>         if (ctx->enable_mba_mbps) {
>                 ret = set_mba_sc(true);
>                 if (ret)
>                         goto out_cdpl3;
>         }
> 
> So set_mba_sc() is only called when the mba_MBps mount option has been used. So
> mba_mbps_event_init() isn't called.
> 
> Note that on unmount of the resctrl file system rdt_kill_sb() calls rdt_disable_ctx()
> which calls set_mba_sc(false) which will clear rft->fflags on systems which support
> mba_MBps.

It sounds as though you are saying that setting the wrong flags are ok as long as it is
done in some specific calling sequence. Is this correct? Relying on calling sequence
instead of correct flags requires more in-depth knowledge of code flows and makes code
harder to maintain.
Why not just make this "RFTYPE_CTRL_BASE | RFTYPE_MON_BASE" to make it clear that the file
applies to CTRL_MON groups? What am I missing?

Reinette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file
  2024-11-12 23:57           ` Luck, Tony
@ 2024-11-13  0:40             ` Reinette Chatre
  0 siblings, 0 replies; 32+ messages in thread
From: Reinette Chatre @ 2024-11-13  0:40 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 11/12/24 3:57 PM, Luck, Tony wrote:
>>>>>> +   if (!strcmp(buf, "mbm_local_bytes")) {
>>>>>> +           if (is_mbm_local_enabled())
>>>>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
>>>>>> +           else
>>>>>> +                   ret = -ENXIO;
>>>>>> +   } else if (!strcmp(buf, "mbm_total_bytes")) {
>>>>>> +           if (is_mbm_total_enabled())
>>>>>> +                   rdtgrp->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
>>>>>
>>>>>
>>>>> User may think each time toggling the local/total event will effect MBA.
>>>>> And they may create usage case like frequently changing the events to
>>>>> maintain/adjust both total and local within bw boundary.
>>>>>
>>>>> But toggling mba_mbps_event faster than 1sec doesn't have any effect on
>>>>> MBA SC because MBA SC is called every one second.
>>>>>
>>>>> Maybe need to add a ratelimit of 1 second on calling this function? And
>>>>> adding info in the document that toggling speed should be slower than 1
>>>>> second?
>>>>
>>>> The limit would need to be per ctrl_mon group, not on calls to this function.
>>>> It's perfectly ok to switch multiple groups in a short interval.
>>>
>>> Agree.
>>>
>>>>
>>>> I'm not sure how to rate limit here. I could add a delay so that the write()
>>>> call blocks until enough time passes before making the change. But
>>>> what should I do if a user submits more writes to the file? Queue them
>>>> all and apply at one second intervals?
>>>
>>> Maybe define "mba_mbps_last_time" in rdtgroup. Then
>>>
>>> if (time_before(jiffies, rdtgrp->mba_mbps_last_time + HZ) {
>>>     rdt_last_cmd_printf("Too fast (>1/s) mba_MBps event change)\n");
>>>         rdtgroup_kn_unlock(of->kn);
>>>     return -EAGAIN;
>>> }
>>> rdtgrp->mba_mbps_last_time = jiffies;
>>
>> This seems like enforcing an unnecessary limitation. For example, this would mean
>> that user space needs to wait for a second to undo a change to the monitoring event
>> in case there was a mistake. This seems like an unnecessary restriction to me.
>>
>> I am also afraid that there may be some corner cases where a write to the file and
>> the actual run of the overflow handler  (on every domain) may not be exactly HZ apart.
>>
>> Bandwidth allocation remains to be adjusted in either direction with at most the bandwidth
>> granularity. A user attempting to toggle bandwidth event cannot expecting large
>> changes in allocated bandwidth even if the events differ significantly.
>>
>> Surely we can explore more if we learn about a specific use case.
> 
> Note that the kernel generally doesn't prevent the user from doing inane things
> that do not cause damage.  E.g. a user can already abuse the legacy memory
> percentage bandwidth controls with
> 
> while :
> do
> 	echo "MB:0=100;1=10" > schemata
> 	sleep 0.1
> 	echo "MB:0=10;1=100" > schemata
> 	sleep 0.1
> done
> 
> Similar abuse of the percentage mode is also possible.

Good point.

At least with the percentage modes the MBA allocation is promptly written to
hardware.

Related to issue at hand the user is also already able to quickly switch the MBps
value via quick successive writes to the schemata and those changes are already
invisible to the software controller that only consumes user's MBps once per second.

>>>> Maybe it would be better to just to add some additional text to the
>>>> documentation pointing out that resctrl only checks bandwidth once
>>>> per second to make throttling adjustments. So changes to the event
>>>> will only have effect after some seconds have passed?
>>>
>>>
>>> Add additional text would be great.
>>
>>
>> Agreed.
> 
> We hadn't previously documented the rate at which mba_sc measured and adjusted
> the memory bandwidth allocation controls. Once per second is currently an implementation
> detail that in theory could be changed in the future.
> 
> I'm reluctant to carve that value into the stone of resctrl.rst, But without a specific
> value "don't change the values too rapidly" is meaningless.

I agree that one second should not be part of documentation, a higher level generic "interval"
can be used instead. But ... since you pointed out that writing to schemata also has this
caveat any documentation should ideally address that also. With existing documentation seemingly
sufficient in that regard I'll also be ok with keeping it as high level with this new event
support.

Reinette


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories
  2024-11-13  0:20       ` Reinette Chatre
@ 2024-11-13  0:53         ` Luck, Tony
  2024-11-13  2:54           ` Reinette Chatre
  0 siblings, 1 reply; 32+ messages in thread
From: Luck, Tony @ 2024-11-13  0:53 UTC (permalink / raw)
  To: Chatre, Reinette, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

> >>> +static int set_mba_sc(bool mba_sc)
> >>> +{
> >>> +   struct rftype *rft;
> >>> +
> >>> +   rft = rdtgroup_get_rftype_by_name("mba_MBps_event");
> >>> +   if (rft)
> >>> +           rft->fflags = enable ? RFTYPE_CTRL_BASE : 0;
> >>
> >> I think this sets this file to be created for all CTRL groups, even when not supporting
> >> monitoring?

I think I misunderstood you. I though you said the these mba_MBps_event files
would be created even if monitoring is not supported,

But now I wonder what you mean by "all CTRL groups".

> > No. The calling sequence is:
> >
> > rdt_get_tree()
> >     rdt_enable_ctx()
> >
> >         if (ctx->enable_mba_mbps) {
> >                 ret = set_mba_sc(true);
> >                 if (ret)
> >                         goto out_cdpl3;
> >         }
> >
> > So set_mba_sc() is only called when the mba_MBps mount option has been used. So
> > mba_mbps_event_init() isn't called.
> >
> > Note that on unmount of the resctrl file system rdt_kill_sb() calls rdt_disable_ctx()
> > which calls set_mba_sc(false) which will clear rft->fflags on systems which support
> > mba_MBps.
>
> It sounds as though you are saying that setting the wrong flags are ok as long as it is
> done in some specific calling sequence. Is this correct? Relying on calling sequence
> instead of correct flags requires more in-depth knowledge of code flows and makes code
> harder to maintain.
> Why not just make this "RFTYPE_CTRL_BASE | RFTYPE_MON_BASE" to make it clear that the file
> applies to CTRL_MON groups? What am I missing?

The directories which need this new file are the same ones that get a "schemata" file.

The initialization of fflags for the schemata file just uses RFTYPE_CTRL_BASE:

        {
                .name           = "schemata",
                .mode           = 0644,
                .kf_ops         = &rdtgroup_kf_single_ops,
                .write          = rdtgroup_schemata_write,
                .seq_show       = rdtgroup_schemata_show,
                .fflags         = RFTYPE_CTRL_BASE,
        },

I don't see any files using .fflags = "RFTYPE_CTRL_BASE | RFTYPE_MON_BASE"

-Tony


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories
  2024-11-13  0:53         ` Luck, Tony
@ 2024-11-13  2:54           ` Reinette Chatre
  0 siblings, 0 replies; 32+ messages in thread
From: Reinette Chatre @ 2024-11-13  2:54 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Peter Newman, Jonathan Corbet,
	Shuah Khan, x86@kernel.org
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 11/12/24 4:53 PM, Luck, Tony wrote:
>>>>> +static int set_mba_sc(bool mba_sc)
>>>>> +{
>>>>> +   struct rftype *rft;
>>>>> +
>>>>> +   rft = rdtgroup_get_rftype_by_name("mba_MBps_event");
>>>>> +   if (rft)
>>>>> +           rft->fflags = enable ? RFTYPE_CTRL_BASE : 0;
>>>>
>>>> I think this sets this file to be created for all CTRL groups, even when not supporting
>>>> monitoring?
> 
> I think I misunderstood you. I though you said the these mba_MBps_event files
> would be created even if monitoring is not supported,
> 
> But now I wonder what you mean by "all CTRL groups".
> 
>>> No. The calling sequence is:
>>>
>>> rdt_get_tree()
>>>     rdt_enable_ctx()
>>>
>>>         if (ctx->enable_mba_mbps) {
>>>                 ret = set_mba_sc(true);
>>>                 if (ret)
>>>                         goto out_cdpl3;
>>>         }
>>>
>>> So set_mba_sc() is only called when the mba_MBps mount option has been used. So
>>> mba_mbps_event_init() isn't called.
>>>
>>> Note that on unmount of the resctrl file system rdt_kill_sb() calls rdt_disable_ctx()
>>> which calls set_mba_sc(false) which will clear rft->fflags on systems which support
>>> mba_MBps.
>>
>> It sounds as though you are saying that setting the wrong flags are ok as long as it is
>> done in some specific calling sequence. Is this correct? Relying on calling sequence
>> instead of correct flags requires more in-depth knowledge of code flows and makes code
>> harder to maintain.
>> Why not just make this "RFTYPE_CTRL_BASE | RFTYPE_MON_BASE" to make it clear that the file
>> applies to CTRL_MON groups? What am I missing?
> 
> The directories which need this new file are the same ones that get a "schemata" file.

Only support for control is required for "schemata" to be created. Monitoring support is not
required for "schemata" to be created. This new file requires both monitoring and control support. 

> 
> The initialization of fflags for the schemata file just uses RFTYPE_CTRL_BASE:
> 
>         {
>                 .name           = "schemata",
>                 .mode           = 0644,
>                 .kf_ops         = &rdtgroup_kf_single_ops,
>                 .write          = rdtgroup_schemata_write,
>                 .seq_show       = rdtgroup_schemata_show,
>                 .fflags         = RFTYPE_CTRL_BASE,
>         },
> 
> I don't see any files using .fflags = "RFTYPE_CTRL_BASE | RFTYPE_MON_BASE"
> 

I do not think there is a precedent for this case where a file requires both control and
monitoring support.

Reinette 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-10-29 17:28 ` [PATCH v8 3/7] x86/resctrl: Refactor mbm_update() Tony Luck
  2024-11-01 22:08   ` Fenghua Yu
@ 2024-11-13 22:25   ` Reinette Chatre
  2024-11-13 22:58     ` Tony Luck
  1 sibling, 1 reply; 32+ messages in thread
From: Reinette Chatre @ 2024-11-13 22:25 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan,
	x86
  Cc: James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On 10/29/24 10:28 AM, Tony Luck wrote:
> Computing memory bandwidth for all enabled events resulted in
> identical code blocks for total and local bandwidth in mbm_update().
> 
> Refactor with a helper function to eliminate code duplication.
> 
> No functional change.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/monitor.c | 69 ++++++++++-----------------
>  1 file changed, 24 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3ef339e405c2..1b6cb3bbc008 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -829,62 +829,41 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
>  	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
>  }
>  
> -static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
> -		       u32 closid, u32 rmid)
> +static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
> +				 u32 closid, u32 rmid, enum resctrl_event_id evtid)
>  {
>  	struct rmid_read rr = {0};
>  
>  	rr.r = r;
>  	rr.d = d;
> +	rr.evtid = evtid;
> +	rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
> +	if (IS_ERR(rr.arch_mon_ctx)) {
> +		pr_warn_ratelimited("Failed to allocate monitor context: %ld",
> +				    PTR_ERR(rr.arch_mon_ctx));
> +		return;
> +	}
> +
> +	__mon_event_count(closid, rmid, &rr);
> +
> +	if (is_mba_sc(NULL))
> +		mbm_bw_count(closid, rmid, &rr);
> +

As I am staring at this more there seems to be an existing issue here ... note how
__mon_event_count()'s return value is not checked before mbm_bw_count() is called.
This means that mbm_bw_count() may run with rr.val of 0 that results in wraparound
inside it resulting in some unexpected bandwidth numbers. Since a counter read can fail
with a "Unavailable"/"Error" from hardware it is not deterministic how frequently this
issue can be encountered.

Skipping mbm_bw_count() if rr.val is 0 is one option ... that would keep the bandwidth
measurement static at whatever was the last successful read and thus not cause dramatic
changes by the software controller ... setting bandwidth to 0 if rr.val is 0 is another
option to reflect that bandwidth data is unavailable, but then the software controller should
perhaps get signal to not make adjustments? I expect there are better options? What do
you think?

Reinette


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-11-13 22:25   ` Reinette Chatre
@ 2024-11-13 22:58     ` Tony Luck
  2024-11-13 23:58       ` Reinette Chatre
  2024-11-14 10:31       ` Peter Newman
  0 siblings, 2 replies; 32+ messages in thread
From: Tony Luck @ 2024-11-13 22:58 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

On Wed, Nov 13, 2024 at 02:25:53PM -0800, Reinette Chatre wrote:
> Hi Tony,
> 
> On 10/29/24 10:28 AM, Tony Luck wrote:
> > Computing memory bandwidth for all enabled events resulted in
> > identical code blocks for total and local bandwidth in mbm_update().
> > 
> > Refactor with a helper function to eliminate code duplication.
> > 
> > No functional change.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  arch/x86/kernel/cpu/resctrl/monitor.c | 69 ++++++++++-----------------
> >  1 file changed, 24 insertions(+), 45 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 3ef339e405c2..1b6cb3bbc008 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -829,62 +829,41 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
> >  	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
> >  }
> >  
> > -static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
> > -		       u32 closid, u32 rmid)
> > +static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
> > +				 u32 closid, u32 rmid, enum resctrl_event_id evtid)
> >  {
> >  	struct rmid_read rr = {0};
> >  
> >  	rr.r = r;
> >  	rr.d = d;
> > +	rr.evtid = evtid;
> > +	rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
> > +	if (IS_ERR(rr.arch_mon_ctx)) {
> > +		pr_warn_ratelimited("Failed to allocate monitor context: %ld",
> > +				    PTR_ERR(rr.arch_mon_ctx));
> > +		return;
> > +	}
> > +
> > +	__mon_event_count(closid, rmid, &rr);
> > +
> > +	if (is_mba_sc(NULL))
> > +		mbm_bw_count(closid, rmid, &rr);
> > +
> 
> As I am staring at this more there seems to be an existing issue here ... note how
> __mon_event_count()'s return value is not checked before mbm_bw_count() is called.
> This means that mbm_bw_count() may run with rr.val of 0 that results in wraparound
> inside it resulting in some unexpected bandwidth numbers. Since a counter read can fail
> with a "Unavailable"/"Error" from hardware it is not deterministic how frequently this
> issue can be encountered.
> 
> Skipping mbm_bw_count() if rr.val is 0 is one option ... that would keep the bandwidth
> measurement static at whatever was the last successful read and thus not cause dramatic
> changes by the software controller ... setting bandwidth to 0 if rr.val is 0 is another
> option to reflect that bandwidth data is unavailable, but then the software controller should
> perhaps get signal to not make adjustments? I expect there are better options? What do
> you think?

Skipping mbm_bw_count() is also undesirable. If some later
__mon_event_count() does succeed the bandwidth will be computed
based on the last and current values as if they were one second
apart, when actually some longer interval elapsed.

I don't think this is a big issue for current Intel CPU RDT
implementations because I don't think they will return the
bit 62 unavailable value in the IA32_QM_CTR MSR. I'll ask
around to check.

But it does mean that implementing the "summary bandwidth"
file discussed in the other e-mail thread[1] may be more
complex on systems that can return that a counter is
unavailable. We'd have to keep track that two succesful
counter reads occured, with a measure of the interval
between them before reporting a value in the summary file.

-Tony

[1] https://lore.kernel.org/all/CALPaoCjCWZ4ZYfwooFEzMn15jJM7s9Rfq83YhorOGUD=1GdSyw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-11-13 22:58     ` Tony Luck
@ 2024-11-13 23:58       ` Reinette Chatre
  2024-11-14 10:31       ` Peter Newman
  1 sibling, 0 replies; 32+ messages in thread
From: Reinette Chatre @ 2024-11-13 23:58 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Peter Newman, Jonathan Corbet, Shuah Khan, x86,
	James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On 11/13/24 2:58 PM, Tony Luck wrote:
> On Wed, Nov 13, 2024 at 02:25:53PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 10/29/24 10:28 AM, Tony Luck wrote:
>>> Computing memory bandwidth for all enabled events resulted in
>>> identical code blocks for total and local bandwidth in mbm_update().
>>>
>>> Refactor with a helper function to eliminate code duplication.
>>>
>>> No functional change.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>>  arch/x86/kernel/cpu/resctrl/monitor.c | 69 ++++++++++-----------------
>>>  1 file changed, 24 insertions(+), 45 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index 3ef339e405c2..1b6cb3bbc008 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -829,62 +829,41 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
>>>  	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
>>>  }
>>>  
>>> -static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
>>> -		       u32 closid, u32 rmid)
>>> +static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
>>> +				 u32 closid, u32 rmid, enum resctrl_event_id evtid)
>>>  {
>>>  	struct rmid_read rr = {0};
>>>  
>>>  	rr.r = r;
>>>  	rr.d = d;
>>> +	rr.evtid = evtid;
>>> +	rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
>>> +	if (IS_ERR(rr.arch_mon_ctx)) {
>>> +		pr_warn_ratelimited("Failed to allocate monitor context: %ld",
>>> +				    PTR_ERR(rr.arch_mon_ctx));
>>> +		return;
>>> +	}
>>> +
>>> +	__mon_event_count(closid, rmid, &rr);
>>> +
>>> +	if (is_mba_sc(NULL))
>>> +		mbm_bw_count(closid, rmid, &rr);
>>> +
>>
>> As I am staring at this more there seems to be an existing issue here ... note how
>> __mon_event_count()'s return value is not checked before mbm_bw_count() is called.
>> This means that mbm_bw_count() may run with rr.val of 0 that results in wraparound
>> inside it resulting in some unexpected bandwidth numbers. Since a counter read can fail
>> with a "Unavailable"/"Error" from hardware it is not deterministic how frequently this
>> issue can be encountered.
>>
>> Skipping mbm_bw_count() if rr.val is 0 is one option ... that would keep the bandwidth
>> measurement static at whatever was the last successful read and thus not cause dramatic
>> changes by the software controller ... setting bandwidth to 0 if rr.val is 0 is another
>> option to reflect that bandwidth data is unavailable, but then the software controller should
>> perhaps get signal to not make adjustments? I expect there are better options? What do
>> you think?
> 
> Skipping mbm_bw_count() is also undesirable. If some later
> __mon_event_count() does succeed the bandwidth will be computed
> based on the last and current values as if they were one second
> apart, when actually some longer interval elapsed.

Indeed.

> 
> I don't think this is a big issue for current Intel CPU RDT
> implementations because I don't think they will return the
> bit 62 unavailable value in the IA32_QM_CTR MSR. I'll ask
> around to check.

Thank you very much for confirming this. 

> 
> But it does mean that implementing the "summary bandwidth"
> file discussed in the other e-mail thread[1] may be more
> complex on systems that can return that a counter is
> unavailable. We'd have to keep track that two succesful
> counter reads occured, with a measure of the interval
> between them before reporting a value in the summary file.

Looking at expanding the scope of mbm_bw_count() beyond software
controller as well as beyond Intel to support [1] is indeed why I
am looking at this code more.

Reinette


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-11-13 22:58     ` Tony Luck
  2024-11-13 23:58       ` Reinette Chatre
@ 2024-11-14 10:31       ` Peter Newman
  2024-11-14 17:20         ` Tony Luck
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Newman @ 2024-11-14 10:31 UTC (permalink / raw)
  To: Tony Luck
  Cc: Reinette Chatre, Fenghua Yu, Jonathan Corbet, Shuah Khan, x86,
	James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

Hi Tony,

On Wed, Nov 13, 2024 at 11:58 PM Tony Luck <tony.luck@intel.com> wrote:
>
> On Wed, Nov 13, 2024 at 02:25:53PM -0800, Reinette Chatre wrote:
> > Hi Tony,
> >
> > On 10/29/24 10:28 AM, Tony Luck wrote:
> > > Computing memory bandwidth for all enabled events resulted in
> > > identical code blocks for total and local bandwidth in mbm_update().
> > >
> > > Refactor with a helper function to eliminate code duplication.
> > >
> > > No functional change.
> > >
> > > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > > ---
> > >  arch/x86/kernel/cpu/resctrl/monitor.c | 69 ++++++++++-----------------
> > >  1 file changed, 24 insertions(+), 45 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > > index 3ef339e405c2..1b6cb3bbc008 100644
> > > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > > @@ -829,62 +829,41 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
> > >     resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
> > >  }
> > >
> > > -static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
> > > -                  u32 closid, u32 rmid)
> > > +static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
> > > +                            u32 closid, u32 rmid, enum resctrl_event_id evtid)
> > >  {
> > >     struct rmid_read rr = {0};
> > >
> > >     rr.r = r;
> > >     rr.d = d;
> > > +   rr.evtid = evtid;
> > > +   rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
> > > +   if (IS_ERR(rr.arch_mon_ctx)) {
> > > +           pr_warn_ratelimited("Failed to allocate monitor context: %ld",
> > > +                               PTR_ERR(rr.arch_mon_ctx));
> > > +           return;
> > > +   }
> > > +
> > > +   __mon_event_count(closid, rmid, &rr);
> > > +
> > > +   if (is_mba_sc(NULL))
> > > +           mbm_bw_count(closid, rmid, &rr);
> > > +
> >
> > As I am staring at this more there seems to be an existing issue here ... note how
> > __mon_event_count()'s return value is not checked before mbm_bw_count() is called.
> > This means that mbm_bw_count() may run with rr.val of 0 that results in wraparound
> > inside it resulting in some unexpected bandwidth numbers. Since a counter read can fail
> > with a "Unavailable"/"Error" from hardware it is not deterministic how frequently this
> > issue can be encountered.
> >
> > Skipping mbm_bw_count() if rr.val is 0 is one option ... that would keep the bandwidth
> > measurement static at whatever was the last successful read and thus not cause dramatic
> > changes by the software controller ... setting bandwidth to 0 if rr.val is 0 is another
> > option to reflect that bandwidth data is unavailable, but then the software controller should
> > perhaps get signal to not make adjustments? I expect there are better options? What do
> > you think?
>
> Skipping mbm_bw_count() is also undesirable. If some later
> __mon_event_count() does succeed the bandwidth will be computed
> based on the last and current values as if they were one second
> apart, when actually some longer interval elapsed.
>
> I don't think this is a big issue for current Intel CPU RDT
> implementations because I don't think they will return the
> bit 62 unavailable value in the IA32_QM_CTR MSR. I'll ask
> around to check.
>
> But it does mean that implementing the "summary bandwidth"
> file discussed in the other e-mail thread[1] may be more
> complex on systems that can return that a counter is
> unavailable. We'd have to keep track that two succesful
> counter reads occured, with a measure of the interval
> between them before reporting a value in the summary file.

At least for the purposes of reporting the mbps rate to userspace, my
users will record from the files one per second, so it would be fine
to just report Unavailable (or Unassigned when it's clear that the
error is because counter wasn't unassigned) whenever either the
current or previous reading was not successful. Then they can assume
the value or error reported always refers to the most
recently-completed one-second window.

-Peter

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 3/7] x86/resctrl: Refactor mbm_update()
  2024-11-14 10:31       ` Peter Newman
@ 2024-11-14 17:20         ` Tony Luck
  0 siblings, 0 replies; 32+ messages in thread
From: Tony Luck @ 2024-11-14 17:20 UTC (permalink / raw)
  To: Peter Newman
  Cc: Reinette Chatre, Fenghua Yu, Jonathan Corbet, Shuah Khan, x86,
	James Morse, Jamie Iles, Babu Moger, Randy Dunlap,
	Shaopeng Tan (Fujitsu), linux-kernel, linux-doc, patches

On Thu, Nov 14, 2024 at 11:31:25AM +0100, Peter Newman wrote:
> At least for the purposes of reporting the mbps rate to userspace, my
> users will record from the files one per second, so it would be fine
> to just report Unavailable (or Unassigned when it's clear that the
> error is because counter wasn't unassigned) whenever either the
> current or previous reading was not successful. Then they can assume
> the value or error reported always refers to the most
> recently-completed one-second window.

First hack at keeping status for memory bandwidth values. Simple
state machine.  Compile tested.


diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 6345ab3e0890..80de5c6b0754 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -359,14 +359,22 @@ struct rftype {
 			 char *buf, size_t nbytes, loff_t off);
 };
 
+enum mbm_state_status {
+	MBM_INVALID,
+	MBM_ONE_VALUE,
+	MBM_VALID
+};
+
 /**
  * struct mbm_state - status for each MBM counter in each domain
  * @prev_bw_bytes: Previous bytes value read for bandwidth calculation
  * @prev_bw:	The most recent bandwidth in MBps
+ * @status:	Validity of @prev_bw
  */
 struct mbm_state {
-	u64	prev_bw_bytes;
-	u32	prev_bw;
+	u64			prev_bw_bytes;
+	u32			prev_bw;
+	enum mbm_state_status	status;
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index da4ae21350c8..767a526af2f5 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -670,6 +670,17 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
 	if (WARN_ON_ONCE(!m))
 		return;
 
+	if (rr->err || !rr->val) {
+		m->status = MBM_INVALID;
+		return;
+	}
+
+	if (m->status == MBM_INVALID) {
+		m->status = MBM_ONE_VALUE;
+		m->prev_bw_bytes = rr->val;
+		return;
+	}
+
 	cur_bytes = rr->val;
 	bytes = cur_bytes - m->prev_bw_bytes;
 	m->prev_bw_bytes = cur_bytes;
@@ -677,6 +688,7 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
 	cur_bw = bytes / SZ_1M;
 
 	m->prev_bw = cur_bw;
+	m->status = MBM_VALID;
 }
 
 /*
@@ -781,6 +793,9 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 	if (WARN_ON_ONCE(!pmbm_data))
 		return;
 
+	if (pmbm_data->status != MBM_VALID)
+		return;
+
 	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
@@ -801,6 +816,9 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 		cmbm_data = get_mbm_state(dom_mbm, entry->closid, entry->mon.rmid, evt_id);
 		if (WARN_ON_ONCE(!cmbm_data))
 			return;
+		if (cmbm_data->status != MBM_VALID)
+			return;
+
 		cur_bw += cmbm_data->prev_bw;
 	}
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-11-14 17:20 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 17:28 [PATCH v8 0/7] x86/resctrl: mba_MBps enhancement Tony Luck
2024-10-29 17:28 ` [PATCH v8 1/7] x86/resctrl: Prepare for per-ctrl_mon group mba_MBps control Tony Luck
2024-11-01 22:03   ` Fenghua Yu
2024-11-01 22:40     ` Tony Luck
2024-11-12 19:24   ` Reinette Chatre
2024-10-29 17:28 ` [PATCH v8 2/7] x86/resctrl: Compute memory bandwidth for all supported events Tony Luck
2024-11-12 19:25   ` Reinette Chatre
2024-10-29 17:28 ` [PATCH v8 3/7] x86/resctrl: Refactor mbm_update() Tony Luck
2024-11-01 22:08   ` Fenghua Yu
2024-11-01 22:57     ` Tony Luck
2024-11-13 22:25   ` Reinette Chatre
2024-11-13 22:58     ` Tony Luck
2024-11-13 23:58       ` Reinette Chatre
2024-11-14 10:31       ` Peter Newman
2024-11-14 17:20         ` Tony Luck
2024-10-29 17:28 ` [PATCH v8 4/7] x86/resctrl: Relax checks for mba_MBps mount option Tony Luck
2024-10-29 17:28 ` [PATCH v8 5/7] x86/resctrl: Add "mba_MBps_event" file to ctrl_mon directories Tony Luck
2024-11-12 22:12   ` Reinette Chatre
2024-11-12 23:42     ` Luck, Tony
2024-11-13  0:20       ` Reinette Chatre
2024-11-13  0:53         ` Luck, Tony
2024-11-13  2:54           ` Reinette Chatre
2024-10-29 17:28 ` [PATCH v8 6/7] x86/resctrl: Add write option to "mba_MBps_event" file Tony Luck
2024-11-01 23:26   ` Fenghua Yu
2024-11-01 23:55     ` Luck, Tony
2024-11-02  0:57       ` Fenghua Yu
2024-11-12 22:00         ` Reinette Chatre
2024-11-12 23:57           ` Luck, Tony
2024-11-13  0:40             ` Reinette Chatre
2024-11-12 22:18   ` Reinette Chatre
2024-10-29 17:28 ` [PATCH v8 7/7] x86/resctrl: Document the new " Tony Luck
2024-11-12 22:25   ` Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox