[PATCH 0/4] fs/resctrl: Fix three long-standing issues

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 0/4] fs/resctrl: Fix three long-standing issues
@ 2026-05-08 18:21 Tony Luck
  2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

Sashiko reported[0] a deadlock during mount, and a use-after-free when an
L3 domain is removed during CPU offline. Reinette found a memory leak
in the mount error path while refactoring code for a solution to the
mount hang.

Patch 1 just reorders some code so that fixes can be applied without
adding additional forward declaration of functions.

Patch 2 fixes the memory leak found by Reinette

Patch 3 fixes the mount deadlock[1]

Patch 4 fixes issues with CPU offline.[2]

N.B. Reinette did all the work for patches 3 & 4, so I listed her
as author an me as Co-developer. Those patches need her sign-off.

[0] Sashiko report:
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com

[1] This is v3 of the deadlock fix. Previous version is here:
Link: https://lore.kernel.org/all/20260504220149.157753-1-tony.luck@intel.com/

[2] This is v2 of the CPU offline fix. Previous version is here:
Link: https://lore.kernel.org/all/20260501213611.25600-1-tony.luck@intel.com/

Reinette Chatre (2):
  fs/resctrl: Fix deadlock for errors during mount
  fs/resctrl: Fix issues with worker threads when CPUs are taken offline

Tony Luck (2):
  fs/resctrl: Move functions to avoid forward references in subsequent
    fixes
  fs/resctrl: Free mon_data structures on rdt_get_tree() failure

 fs/resctrl/monitor.c  |  55 ++++++
 fs/resctrl/rdtgroup.c | 440 ++++++++++++++++++++++--------------------
 2 files changed, 287 insertions(+), 208 deletions(-)

-- 
2.54.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

No functional change. Just pull some functions before rdt_get_tree().

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 fs/resctrl/rdtgroup.c | 376 +++++++++++++++++++++---------------------
 1 file changed, 188 insertions(+), 188 deletions(-)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 5dfdaa6f9d8f..a6376a3fc4c3 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2782,6 +2782,194 @@ static void schemata_list_destroy(void)
 	}
 }
 
+/*
+ * Move tasks from one to the other group. If @from is NULL, then all tasks
+ * in the systems are moved unconditionally (used for teardown).
+ *
+ * If @mask is not NULL the cpus on which moved tasks are running are set
+ * in that mask so the update smp function call is restricted to affected
+ * cpus.
+ */
+static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
+				 struct cpumask *mask)
+{
+	struct task_struct *p, *t;
+
+	read_lock(&tasklist_lock);
+	for_each_process_thread(p, t) {
+		if (!from || is_closid_match(t, from) ||
+		    is_rmid_match(t, from)) {
+			resctrl_arch_set_closid_rmid(t, to->closid,
+						     to->mon.rmid);
+
+			/*
+			 * Order the closid/rmid stores above before the loads
+			 * in task_curr(). This pairs with the full barrier
+			 * between the rq->curr update and
+			 * resctrl_arch_sched_in() during context switch.
+			 */
+			smp_mb();
+
+			/*
+			 * If the task is on a CPU, set the CPU in the mask.
+			 * The detection is inaccurate as tasks might move or
+			 * schedule before the smp function call takes place.
+			 * In such a case the function call is pointless, but
+			 * there is no other side effect.
+			 */
+			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
+				cpumask_set_cpu(task_cpu(t), mask);
+		}
+	}
+	read_unlock(&tasklist_lock);
+}
+
+static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
+{
+	struct rdtgroup *sentry, *stmp;
+	struct list_head *head;
+
+	head = &rdtgrp->mon.crdtgrp_list;
+	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
+		rdtgroup_unassign_cntrs(sentry);
+		free_rmid(sentry->closid, sentry->mon.rmid);
+		list_del(&sentry->mon.crdtgrp_list);
+
+		if (atomic_read(&sentry->waitcount) != 0)
+			sentry->flags = RDT_DELETED;
+		else
+			rdtgroup_remove(sentry);
+	}
+}
+
+/*
+ * Forcibly remove all of subdirectories under root.
+ */
+static void rmdir_all_sub(void)
+{
+	struct rdtgroup *rdtgrp, *tmp;
+
+	/* Move all tasks to the default resource group */
+	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
+
+	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
+		/* Free any child rmids */
+		free_all_child_rdtgrp(rdtgrp);
+
+		/* Remove each rdtgroup other than root */
+		if (rdtgrp == &rdtgroup_default)
+			continue;
+
+		if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
+		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
+			rdtgroup_pseudo_lock_remove(rdtgrp);
+
+		/*
+		 * Give any CPUs back to the default group. We cannot copy
+		 * cpu_online_mask because a CPU might have executed the
+		 * offline callback already, but is still marked online.
+		 */
+		cpumask_or(&rdtgroup_default.cpu_mask,
+			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
+
+		rdtgroup_unassign_cntrs(rdtgrp);
+
+		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
+
+		kernfs_remove(rdtgrp->kn);
+		list_del(&rdtgrp->rdtgroup_list);
+
+		if (atomic_read(&rdtgrp->waitcount) != 0)
+			rdtgrp->flags = RDT_DELETED;
+		else
+			rdtgroup_remove(rdtgrp);
+	}
+	/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
+	update_closid_rmid(cpu_online_mask, &rdtgroup_default);
+
+	kernfs_remove(kn_info);
+	kernfs_remove(kn_mongrp);
+	kernfs_remove(kn_mondata);
+}
+
+/**
+ * mon_get_kn_priv() - Get the mon_data priv data for this event.
+ *
+ * The same values are used across the mon_data directories of all control and
+ * monitor groups for the same event in the same domain. Keep a list of
+ * allocated structures and re-use an existing one with the same values for
+ * @rid, @domid, etc.
+ *
+ * @rid:    The resource id for the event file being created.
+ * @domid:  The domain id for the event file being created.
+ * @mevt:   The type of event file being created.
+ * @do_sum: Whether SNC summing monitors are being created. Only set
+ *	    when @rid == RDT_RESOURCE_L3.
+ *
+ * Return: Pointer to mon_data private data of the event, NULL on failure.
+ */
+static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
+					struct mon_evt *mevt,
+					bool do_sum)
+{
+	struct mon_data *priv;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
+		if (priv->rid == rid && priv->domid == domid &&
+		    priv->sum == do_sum && priv->evt == mevt)
+			return priv;
+	}
+
+	priv = kzalloc_obj(*priv);
+	if (!priv)
+		return NULL;
+
+	priv->rid = rid;
+	priv->domid = domid;
+	priv->sum = do_sum;
+	priv->evt = mevt;
+	list_add_tail(&priv->list, &mon_data_kn_priv_list);
+
+	return priv;
+}
+
+/**
+ * mon_put_kn_priv() - Free all allocated mon_data structures.
+ *
+ * Called when resctrl file system is unmounted.
+ */
+static void mon_put_kn_priv(void)
+{
+	struct mon_data *priv, *tmp;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) {
+		list_del(&priv->list);
+		kfree(priv);
+	}
+}
+
+static void resctrl_fs_teardown(void)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	/* Cleared by rdtgroup_destroy_root() */
+	if (!rdtgroup_default.kn)
+		return;
+
+	rmdir_all_sub();
+	rdtgroup_unassign_cntrs(&rdtgroup_default);
+	mon_put_kn_priv();
+	rdt_pseudo_lock_release();
+	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
+	closid_exit();
+	schemata_list_destroy();
+	rdtgroup_destroy_root();
+}
+
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -2981,194 +3169,6 @@ static int rdt_init_fs_context(struct fs_context *fc)
 	return 0;
 }
 
-/*
- * Move tasks from one to the other group. If @from is NULL, then all tasks
- * in the systems are moved unconditionally (used for teardown).
- *
- * If @mask is not NULL the cpus on which moved tasks are running are set
- * in that mask so the update smp function call is restricted to affected
- * cpus.
- */
-static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
-				 struct cpumask *mask)
-{
-	struct task_struct *p, *t;
-
-	read_lock(&tasklist_lock);
-	for_each_process_thread(p, t) {
-		if (!from || is_closid_match(t, from) ||
-		    is_rmid_match(t, from)) {
-			resctrl_arch_set_closid_rmid(t, to->closid,
-						     to->mon.rmid);
-
-			/*
-			 * Order the closid/rmid stores above before the loads
-			 * in task_curr(). This pairs with the full barrier
-			 * between the rq->curr update and
-			 * resctrl_arch_sched_in() during context switch.
-			 */
-			smp_mb();
-
-			/*
-			 * If the task is on a CPU, set the CPU in the mask.
-			 * The detection is inaccurate as tasks might move or
-			 * schedule before the smp function call takes place.
-			 * In such a case the function call is pointless, but
-			 * there is no other side effect.
-			 */
-			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
-				cpumask_set_cpu(task_cpu(t), mask);
-		}
-	}
-	read_unlock(&tasklist_lock);
-}
-
-static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
-{
-	struct rdtgroup *sentry, *stmp;
-	struct list_head *head;
-
-	head = &rdtgrp->mon.crdtgrp_list;
-	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
-		rdtgroup_unassign_cntrs(sentry);
-		free_rmid(sentry->closid, sentry->mon.rmid);
-		list_del(&sentry->mon.crdtgrp_list);
-
-		if (atomic_read(&sentry->waitcount) != 0)
-			sentry->flags = RDT_DELETED;
-		else
-			rdtgroup_remove(sentry);
-	}
-}
-
-/*
- * Forcibly remove all of subdirectories under root.
- */
-static void rmdir_all_sub(void)
-{
-	struct rdtgroup *rdtgrp, *tmp;
-
-	/* Move all tasks to the default resource group */
-	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
-
-	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
-		/* Free any child rmids */
-		free_all_child_rdtgrp(rdtgrp);
-
-		/* Remove each rdtgroup other than root */
-		if (rdtgrp == &rdtgroup_default)
-			continue;
-
-		if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
-		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
-			rdtgroup_pseudo_lock_remove(rdtgrp);
-
-		/*
-		 * Give any CPUs back to the default group. We cannot copy
-		 * cpu_online_mask because a CPU might have executed the
-		 * offline callback already, but is still marked online.
-		 */
-		cpumask_or(&rdtgroup_default.cpu_mask,
-			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
-
-		rdtgroup_unassign_cntrs(rdtgrp);
-
-		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
-
-		kernfs_remove(rdtgrp->kn);
-		list_del(&rdtgrp->rdtgroup_list);
-
-		if (atomic_read(&rdtgrp->waitcount) != 0)
-			rdtgrp->flags = RDT_DELETED;
-		else
-			rdtgroup_remove(rdtgrp);
-	}
-	/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
-	update_closid_rmid(cpu_online_mask, &rdtgroup_default);
-
-	kernfs_remove(kn_info);
-	kernfs_remove(kn_mongrp);
-	kernfs_remove(kn_mondata);
-}
-
-/**
- * mon_get_kn_priv() - Get the mon_data priv data for this event.
- *
- * The same values are used across the mon_data directories of all control and
- * monitor groups for the same event in the same domain. Keep a list of
- * allocated structures and re-use an existing one with the same values for
- * @rid, @domid, etc.
- *
- * @rid:    The resource id for the event file being created.
- * @domid:  The domain id for the event file being created.
- * @mevt:   The type of event file being created.
- * @do_sum: Whether SNC summing monitors are being created. Only set
- *	    when @rid == RDT_RESOURCE_L3.
- *
- * Return: Pointer to mon_data private data of the event, NULL on failure.
- */
-static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
-					struct mon_evt *mevt,
-					bool do_sum)
-{
-	struct mon_data *priv;
-
-	lockdep_assert_held(&rdtgroup_mutex);
-
-	list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
-		if (priv->rid == rid && priv->domid == domid &&
-		    priv->sum == do_sum && priv->evt == mevt)
-			return priv;
-	}
-
-	priv = kzalloc_obj(*priv);
-	if (!priv)
-		return NULL;
-
-	priv->rid = rid;
-	priv->domid = domid;
-	priv->sum = do_sum;
-	priv->evt = mevt;
-	list_add_tail(&priv->list, &mon_data_kn_priv_list);
-
-	return priv;
-}
-
-/**
- * mon_put_kn_priv() - Free all allocated mon_data structures.
- *
- * Called when resctrl file system is unmounted.
- */
-static void mon_put_kn_priv(void)
-{
-	struct mon_data *priv, *tmp;
-
-	lockdep_assert_held(&rdtgroup_mutex);
-
-	list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) {
-		list_del(&priv->list);
-		kfree(priv);
-	}
-}
-
-static void resctrl_fs_teardown(void)
-{
-	lockdep_assert_held(&rdtgroup_mutex);
-
-	/* Cleared by rdtgroup_destroy_root() */
-	if (!rdtgroup_default.kn)
-		return;
-
-	rmdir_all_sub();
-	rdtgroup_unassign_cntrs(&rdtgroup_default);
-	mon_put_kn_priv();
-	rdt_pseudo_lock_release();
-	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
-	closid_exit();
-	schemata_list_destroy();
-	rdtgroup_destroy_root();
-}
-
 static void rdt_kill_sb(struct super_block *sb)
 {
 	struct rdt_resource *r;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
  2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  2026-05-08 21:36   ` Luck, Tony
  2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
  2026-05-08 18:21 ` [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline Tony Luck
  3 siblings, 1 reply; 14+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
fails, the mon_data structures allocated by mon_get_kn_priv() are
leaked. Add mon_put_kn_priv() to the out_mondata error path to free
them.

mon_get_kn_priv() and mon_put_kn_priv() moved so defined before used.

Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Assisted-by: GitHub_Copilot_CLI:claude-sonnet-4.6
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 fs/resctrl/rdtgroup.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a6376a3fc4c3..0db1a92aefbe 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3067,8 +3067,10 @@ static int rdt_get_tree(struct fs_context *fc)
 out_psl:
 	rdt_pseudo_lock_release();
 out_mondata:
-	if (resctrl_arch_mon_capable())
+	if (resctrl_arch_mon_capable()) {
+		mon_put_kn_priv();
 		kernfs_remove(kn_mondata);
+	}
 out_mongrp:
 	if (resctrl_arch_mon_capable()) {
 		rdtgroup_unassign_cntrs(&rdtgroup_default);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
  2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
  2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  2026-05-10 13:52   ` Chen, Yu C
  2026-05-11 22:53   ` Reinette Chatre
  2026-05-08 18:21 ` [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline Tony Luck
  3 siblings, 2 replies; 14+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

From: Reinette Chatre <reinette.chatre@intel.com>

Sashiko noticed[1] a deadlock in the resctrl mount code.

rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If
superblock setup fails inside kernfs_get_tree(), the VFS calls kill_sb on
the same thread before the call returns.  rdt_kill_sb() unconditionally
attempts to acquire rdtgroup_mutex and deadlock occurs.

Move the call to kernfs_get_tree() outside of locks.

Add resctrl_unmount() helper to keep code consistent between the
rdt_get_tree() failure path and a normal unmount.

If kernfs_get_tree() fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb()
has already been called and no further cleanup is needed.

Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support")
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
---
 fs/resctrl/rdtgroup.c | 63 ++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 0db1a92aefbe..62e1e4c30f78 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2970,6 +2970,29 @@ static void resctrl_fs_teardown(void)
 	rdtgroup_destroy_root();
 }
 
+static void resctrl_unmount(void)
+{
+	struct rdt_resource *r;
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	rdt_disable_ctx();
+
+	/* Put everything back to default values. */
+	for_each_alloc_capable_rdt_resource(r)
+		resctrl_arch_reset_all_ctrls(r);
+
+	resctrl_fs_teardown();
+	if (resctrl_arch_alloc_capable())
+		resctrl_arch_disable_alloc();
+	if (resctrl_arch_mon_capable())
+		resctrl_arch_disable_mon();
+	resctrl_mounted = false;
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+}
+
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -3043,10 +3066,6 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (ret)
 		goto out_mondata;
 
-	ret = kernfs_get_tree(fc);
-	if (ret < 0)
-		goto out_psl;
-
 	if (resctrl_arch_alloc_capable())
 		resctrl_arch_enable_alloc();
 	if (resctrl_arch_mon_capable())
@@ -3062,10 +3081,19 @@ static int rdt_get_tree(struct fs_context *fc)
 						   RESCTRL_PICK_ANY_CPU);
 	}
 
-	goto out;
+	rdt_last_cmd_clear();
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	ret = kernfs_get_tree(fc);
+	/*
+	 * resctrl can only be mounted once, new superblock only expected
+	 * to be created once.
+	 */
+	if (!ctx->kfc.new_sb_created)
+		resctrl_unmount();
+	return ret;
 
-out_psl:
-	rdt_pseudo_lock_release();
 out_mondata:
 	if (resctrl_arch_mon_capable()) {
 		mon_put_kn_priv();
@@ -3086,7 +3114,6 @@ static int rdt_get_tree(struct fs_context *fc)
 out_root:
 	rdtgroup_destroy_root();
 out:
-	rdt_last_cmd_clear();
 	mutex_unlock(&rdtgroup_mutex);
 	cpus_read_unlock();
 	return ret;
@@ -3173,26 +3200,8 @@ static int rdt_init_fs_context(struct fs_context *fc)
 
 static void rdt_kill_sb(struct super_block *sb)
 {
-	struct rdt_resource *r;
-
-	cpus_read_lock();
-	mutex_lock(&rdtgroup_mutex);
-
-	rdt_disable_ctx();
-
-	/* Put everything back to default values. */
-	for_each_alloc_capable_rdt_resource(r)
-		resctrl_arch_reset_all_ctrls(r);
-
-	resctrl_fs_teardown();
-	if (resctrl_arch_alloc_capable())
-		resctrl_arch_disable_alloc();
-	if (resctrl_arch_mon_capable())
-		resctrl_arch_disable_mon();
-	resctrl_mounted = false;
+	resctrl_unmount();
 	kernfs_kill_sb(sb);
-	mutex_unlock(&rdtgroup_mutex);
-	cpus_read_unlock();
 }
 
 static struct file_system_type rdt_fs_type = {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
                   ` (2 preceding siblings ...)
  2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  2026-05-11 23:06   ` Reinette Chatre
  3 siblings, 1 reply; 14+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

From: Reinette Chatre <reinette.chatre@intel.com>

Sashiko noticed[1] a user-after-free in the resctrl worker thread code
where the rdt_l3_mon_domain structure was freed while the worker was blocked
waiting for locks.

The root issue is that cancel_delayed_work() does not block in the case where
the worker thread is executing. This results in the race that Sashiko noticed,
but also causes problems when the CPU that has been chosen to service the
worker thread is taken offline.

Note that worker threads are allowed to delete their own work_struct
(see comment in kernel/workqueue.c:process_one_work()) so there can't be
any problems on the return path from the worker in this case where the
work_struct was deleted by other code while the worker was executing.

Indicate failure of cancel_delayed_work() calls in resctrl_offline_cpu()
by setting d->mbm_work_cpu or d->cqm_work_cpu to nr_cpu_ids. Make the worker
threads check to see if they are no longer bound to the right CPU. In this
case search the L3 domain list for any domain(s) with the work cpu set to
nr_cpu_ids. In the case where the last CPU was removed from a domain, the
domain has been removed from the list and there is nothing to do. If the
domain still exists, then restart the worker on any of the remaining CPUs.

Remove redundant cancel_delayed_work() calls from resctrl_offline_mon_domain().

Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
---
 fs/resctrl/monitor.c  | 55 +++++++++++++++++++++++++++++++++++++++++++
 fs/resctrl/rdtgroup.c | 27 +++++++++++++++------
 2 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9fd901c78dc6..02434d11e024 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -791,12 +791,38 @@ static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
  */
 void cqm_handle_limbo(struct work_struct *work)
 {
+	struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	struct rdt_l3_mon_domain *d;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
+	/*
+	 * Worker was blocked waiting for the CPU it was running on to go
+	 * offline. Handle two scenarios:
+	 * - Worker was running on the last CPU of a domain. The domain and
+	 *   thus the work_struct has been freed so do not attempt to obtain
+	 *   domain via container_of(). All remaining domains have limbo
+	 *   handlers so the loop will not find any domains needing a
+	 *   limbo handler. Just exit.
+	 * - Worker was running on CPU that just went offline with other
+	 *   CPUs in domain still running and available to take over the
+	 *   worker. Offline handler could not schedule a new worker on
+	 *   another CPU in the domain but signaled that this needs to be
+	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
+	 *   that needs a worker and schedule it after the normal CQM
+	 *   interval.
+	 */
+	if (!is_percpu_thread()) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			if (d->cqm_work_cpu == nr_cpu_ids)
+				cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL,
+							RESCTRL_PICK_ANY_CPU);
+		}
+		goto out_unlock;
+	}
+
 	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
@@ -808,6 +834,7 @@ void cqm_handle_limbo(struct work_struct *work)
 					 delay);
 	}
 
+out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
 	cpus_read_unlock();
 }
@@ -852,6 +879,34 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+
+	/*
+	 * Worker was blocked waiting for the CPU it was running on to go
+	 * offline. Handle two scenarios:
+	 * - Worker was running on the last CPU of a domain. The domain and
+	 *   thus the work_struct has been freed so do not attempt to obtain
+	 *   domain via container_of(). All remaining domains have overflow
+	 *   handlers so the loop will not find any domains needing an
+	 *   overflow handler. Just exit.
+	 * - Worker was running on CPU that just went offline with other
+	 *   CPUs in domain still running and available to take over the
+	 *   worker. Offline handler could not schedule a new worker on
+	 *   another CPU in the domain but signaled that this needs to be
+	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
+	 *   that needs a worker and schedule it to run after the normal
+	 *   MBM interval. This is completely safe on CPUs with wide MBM
+	 *   counters. Likely OK for old CPUs with narrow counters as the
+	 *   MBM_OVERFLOW_INTERVAL was picked conservatively.
+	 */
+	if (!is_percpu_thread()) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			if (d->mbm_work_cpu == nr_cpu_ids)
+				mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL,
+							   RESCTRL_PICK_ANY_CPU);
+		}
+		goto out_unlock;
+	}
+
 	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 62e1e4c30f78..bab9afd5066e 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4343,8 +4343,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		goto out_unlock;
 
 	d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
-	if (resctrl_is_mbm_enabled())
-		cancel_delayed_work(&d->mbm_over);
+
 	if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
 		/*
 		 * When a package is going down, forcefully
@@ -4355,7 +4354,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		 * package never comes back.
 		 */
 		__check_limbo(d, true);
-		cancel_delayed_work(&d->cqm_limbo);
 	}
 
 	domain_destroy_l3_mon_state(d);
@@ -4536,13 +4534,28 @@ void resctrl_offline_cpu(unsigned int cpu)
 	d = get_mon_domain_from_cpu(cpu, l3);
 	if (d) {
 		if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) {
-			cancel_delayed_work(&d->mbm_over);
-			mbm_setup_overflow_handler(d, 0, cpu);
+			if (cancel_delayed_work(&d->mbm_over)) {
+				mbm_setup_overflow_handler(d, 0, cpu);
+			} else {
+				/*
+				 * Unable to schedule work on new CPU if it
+				 * is currently running since the re-schedule
+				 * will just force new work to run on
+				 * current CPU. Mark domain's worker as
+				 * needing to be rescheduled to be handled
+				 * by worker itself.
+				 */
+				d->mbm_work_cpu = nr_cpu_ids;
+			}
 		}
 		if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
 		    cpu == d->cqm_work_cpu && has_busy_rmid(d)) {
-			cancel_delayed_work(&d->cqm_limbo);
-			cqm_setup_limbo_handler(d, 0, cpu);
+			if (cancel_delayed_work(&d->cqm_limbo)) {
+				cqm_setup_limbo_handler(d, 0, cpu);
+			} else {
+				/* Same as mbm_work_cpu case above */
+				d->cqm_work_cpu = nr_cpu_ids;
+			}
 		}
 	}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
@ 2026-05-08 21:36   ` Luck, Tony
  2026-05-09 12:43     ` Chen, Yu C
  0 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2026-05-08 21:36 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches

On Fri, May 08, 2026 at 11:21:41AM -0700, Tony Luck wrote:
> If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
> fails, the mon_data structures allocated by mon_get_kn_priv() are
> leaked. Add mon_put_kn_priv() to the out_mondata error path to free
> them.
> 
> mon_get_kn_priv() and mon_put_kn_priv() moved so defined before used.

My bad, leaving this stale commit comment here. Code rearrangement
in patch 1 covered this.

> Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
> Reported-by: Reinette Chatre <reinette.chatre@intel.com>
> Assisted-by: GitHub_Copilot_CLI:claude-sonnet-4.6

Hmmm. I don't think I was assisted at all. Actively sabotaged with a
patch that looks plausible, but it actually wrong seems more accurate.

> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  fs/resctrl/rdtgroup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index a6376a3fc4c3..0db1a92aefbe 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3067,8 +3067,10 @@ static int rdt_get_tree(struct fs_context *fc)
>  out_psl:
>  	rdt_pseudo_lock_release();
>  out_mondata:
> -	if (resctrl_arch_mon_capable())
> +	if (resctrl_arch_mon_capable()) {
> +		mon_put_kn_priv();

Claude put this call here ...

>  		kernfs_remove(kn_mondata);
> +	}
>  out_mongrp:
>  	if (resctrl_arch_mon_capable()) {

But it really ought to be here.

>  		rdtgroup_unassign_cntrs(&rdtgroup_default);
> -- 
> 2.54.0

Claude is fired. Replacement patch below.

-Tony

From 0263035539f805f5d4bddcef8968b551354cb86d Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Thu, 7 May 2026 13:48:51 -0700
Subject: [PATCH] fs/resctrl: Free mon_data structures on rdt_get_tree()
 failure

If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
fails, the mon_data structures allocated by mon_get_kn_priv() are
leaked. Add mon_put_kn_priv() to the out_mongrp error path to free
them.

Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 fs/resctrl/rdtgroup.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a6376a3fc4c3..506b40dc9430 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3071,6 +3071,7 @@ static int rdt_get_tree(struct fs_context *fc)
 		kernfs_remove(kn_mondata);
 out_mongrp:
 	if (resctrl_arch_mon_capable()) {
+		mon_put_kn_priv();
 		rdtgroup_unassign_cntrs(&rdtgroup_default);
 		kernfs_remove(kn_mongrp);
 	}
-- 
2.54.0

> 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-08 21:36   ` Luck, Tony
@ 2026-05-09 12:43     ` Chen, Yu C
  2026-05-11  3:15       ` Luck, Tony
  0 siblings, 1 reply; 14+ messages in thread
From: Chen, Yu C @ 2026-05-09 12:43 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, linux-kernel, patches,
	Maciej Wieczor-Retman, Reinette Chatre, Fenghua Yu, Babu Moger,
	James Morse, Peter Newman, Drew Fustini, Dave Martin

On 5/9/2026 5:36 AM, Luck, Tony wrote:
> On Fri, May 08, 2026 at 11:21:41AM -0700, Tony Luck wrote:
> 
>  From 0263035539f805f5d4bddcef8968b551354cb86d Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Thu, 7 May 2026 13:48:51 -0700
> Subject: [PATCH] fs/resctrl: Free mon_data structures on rdt_get_tree()
>   failure
> 
> If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
> fails, the mon_data structures allocated by mon_get_kn_priv() are
> leaked. Add mon_put_kn_priv() to the out_mongrp error path to free
> them.
> 
> Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")

I did not find above commit in vanilla kernel, should it be 2a6566038544 
("x86/resctrl:
Expand the width of domid by replacing mon_data_bits") where
mon_put_kn_priv() was introduced?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount
  2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
@ 2026-05-10 13:52   ` Chen, Yu C
  2026-05-11 22:53   ` Reinette Chatre
  1 sibling, 0 replies; 14+ messages in thread
From: Chen, Yu C @ 2026-05-10 13:52 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, x86, linux-kernel, patches, Reinette Chatre,
	Maciej Wieczor-Retman, James Morse, Peter Newman, Babu Moger,
	Dave Martin, Fenghua Yu, Drew Fustini

Hi Tony,

On 5/9/2026 2:21 AM, Tony Luck wrote:
> From: Reinette Chatre <reinette.chatre@intel.com>
> 
> Sashiko noticed[1] a deadlock in the resctrl mount code.
> 
> rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If
> superblock setup fails inside kernfs_get_tree(), the VFS calls kill_sb on
> the same thread before the call returns.  rdt_kill_sb() unconditionally
> attempts to acquire rdtgroup_mutex and deadlock occurs.
> 
> Move the call to kernfs_get_tree() outside of locks.
> 
> Add resctrl_unmount() helper to keep code consistent between the
> rdt_get_tree() failure path and a normal unmount.
> 
> If kernfs_get_tree() fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb()
> has already been called and no further cleanup is needed.
> 

[ ... ]

> +
>   static int rdt_get_tree(struct fs_context *fc)
>   {
>   	struct rdt_fs_context *ctx = rdt_fc2context(fc);
> @@ -3043,10 +3066,6 @@ static int rdt_get_tree(struct fs_context *fc)
>   	if (ret)
>   		goto out_mondata;
>   
> -	ret = kernfs_get_tree(fc);
> -	if (ret < 0)
> -		goto out_psl;
> -
>   	if (resctrl_arch_alloc_capable())
>   		resctrl_arch_enable_alloc();
>   	if (resctrl_arch_mon_capable())
> @@ -3062,10 +3081,19 @@ static int rdt_get_tree(struct fs_context *fc)
>   						   RESCTRL_PICK_ANY_CPU);
>   	}
>   
> -	goto out;
> +	rdt_last_cmd_clear();
> +	mutex_unlock(&rdtgroup_mutex);
> +	cpus_read_unlock();
> +
> +	ret = kernfs_get_tree(fc);
> +	/*
> +	 * resctrl can only be mounted once, new superblock only expected
> +	 * to be created once.
> +	 */

I gradually came to understand the purpose of new_sb_created,
and how it indicates whether the resctrl state should be released
or has already been released. Initially, I thought that if 
ctx->kfc.new_sb_created
is false, this could happen either due to a failure to create a 
superblock or
a reference to an existing superblock. In the latter case, the code tears
down all resctrl state via resctrl_unmount() while still returning success.
As a result, VFS might complete the mount with an invalid resctrl state.
I later realized this scenario cannot occur, as mounting resctrl twice is
impossible - kernfs_get_tree() will not return 0 in such cases.  It might
be worthwhile adding a comment here to clarify the rationale/scenario 
behind this
for future reference IMO : )

thanks,
Chenyu

> +	if (!ctx->kfc.new_sb_created)
> +		resctrl_unmount();
> +	return ret;
>   


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-09 12:43     ` Chen, Yu C
@ 2026-05-11  3:15       ` Luck, Tony
  2026-05-12  1:51         ` Chen, Yu C
  0 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2026-05-11  3:15 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Borislav Petkov, x86, linux-kernel, patches,
	Maciej Wieczor-Retman, Reinette Chatre, Fenghua Yu, Babu Moger,
	James Morse, Peter Newman, Drew Fustini, Dave Martin

On Sat, May 09, 2026 at 08:43:36PM +0800, Chen, Yu C wrote:
> On 5/9/2026 5:36 AM, Luck, Tony wrote:
> > On Fri, May 08, 2026 at 11:21:41AM -0700, Tony Luck wrote:
> > 
> >  From 0263035539f805f5d4bddcef8968b551354cb86d Mon Sep 17 00:00:00 2001
> > From: Tony Luck <tony.luck@intel.com>
> > Date: Thu, 7 May 2026 13:48:51 -0700
> > Subject: [PATCH] fs/resctrl: Free mon_data structures on rdt_get_tree()
> >   failure
> > 
> > If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
> > fails, the mon_data structures allocated by mon_get_kn_priv() are
> > leaked. Add mon_put_kn_priv() to the out_mongrp error path to free
> > them.
> > 
> > Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
> 
> I did not find above commit in vanilla kernel, should it be 2a6566038544
> ("x86/resctrl:
> Expand the width of domid by replacing mon_data_bits") where
> mon_put_kn_priv() was introduced?

Thanks for the catch. Another strike against Claude. It didn't hallucinate
this commit for the fixes tag, that commit does appear in my local repo.
But in a dead-end branch I created to write the patch. I posted it here
in v3 of my telemetry series:

https://lore.kernel.org/all/20250407234032.241215-2-tony.luck@intel.com/

But James Morse picked it up in v11 of his "Move resctrl filessytem code
to fs/resctrl" series:

https://lore.kernel.org/all/20250513171547.15194-13-james.morse@arm.com/

where it was applied upstream in the 2a6566038544 commit you found.

I'm somewhat confused that Claude dug through my other branches instead
of just looking for ancestors of the current branch. Maybe I need an
explicit instruction to AI agents on how to find commits to use for
Fixes: tags?

Claude is now double-fired.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount
  2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
  2026-05-10 13:52   ` Chen, Yu C
@ 2026-05-11 22:53   ` Reinette Chatre
  2026-05-12  7:28     ` Chen, Yu C
  1 sibling, 1 reply; 14+ messages in thread
From: Reinette Chatre @ 2026-05-11 22:53 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches

Hi Tony,

On 5/8/26 11:21 AM, Tony Luck wrote:
> @@ -3173,26 +3200,8 @@ static int rdt_init_fs_context(struct fs_context *fc)
>  
>  static void rdt_kill_sb(struct super_block *sb)
>  {
> -	struct rdt_resource *r;
> -
> -	cpus_read_lock();
> -	mutex_lock(&rdtgroup_mutex);
> -
> -	rdt_disable_ctx();
> -
> -	/* Put everything back to default values. */
> -	for_each_alloc_capable_rdt_resource(r)
> -		resctrl_arch_reset_all_ctrls(r);
> -
> -	resctrl_fs_teardown();
> -	if (resctrl_arch_alloc_capable())
> -		resctrl_arch_disable_alloc();
> -	if (resctrl_arch_mon_capable())
> -		resctrl_arch_disable_mon();
> -	resctrl_mounted = false;
> +	resctrl_unmount();
>  	kernfs_kill_sb(sb);
> -	mutex_unlock(&rdtgroup_mutex);
> -	cpus_read_unlock();
>  }
>  
>  static struct file_system_type rdt_fs_type = {

sashiko reports [1] that this fix unmasks a use-after-free. This new issue looks to be
legitimate. I am not comfortable with sashiko's proposed fix since it flips the current
teardown ordering that is based on kernfs guidance per kernfs_kill_sb() function comments:
	* This can be used directly for file_system_type->kill_sb().  If a kernfs      
	* user needs extra cleanup, it can implement its own kill_sb() and call        
	* this function at the end.
	
Something else to also consider is how interference from resctrl_exit() may occur now
that its caller has been merged. What do you instead think of something like below? The
main idea is that it takes an extra reference to the root kn before calling 
kernfs_get_tree(), which it releases upon return. I find such a symmetric solution
relying on reference counts instead of code ordering easier to reason about and by
taking the extra reference with rdtgroup_mutex held it can handle interference from
resctrl_exit(). What do you think?

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 1ec7ff0389d2..4c4f98ebf4a9 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3035,6 +3035,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
+	struct kernfs_node *rdt_root_kn;
 	struct rdt_l3_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
@@ -3119,6 +3120,19 @@ static int rdt_get_tree(struct fs_context *fc)
 						   RESCTRL_PICK_ANY_CPU);
 	}
 
+	rdt_root_kn = rdtgroup_default.kn;
+	/*
+	 * Keep root kn alive across kernfs_get_tree(). If kernfs_get_tree()
+	 * fails after superblock is created (ctx->kfc.new_sb_created is true)
+	 * kernfs_get_tree() will call rdt_kill_sb() itself.
+	 * rdt_kill_sb()->resctrl_unmount()->kernfs_destroy_root() is followed
+	 * by kernfs_kill_sb() that still needs to dereference root kn
+	 * after kernfs_destroy_root().
+	 *
+	 * Obtain reference with locks held to protect against interference
+	 * from resctrl_exit().
+	*/
+	kernfs_get(rdt_root_kn);
 	rdt_last_cmd_clear();
 	mutex_unlock(&rdtgroup_mutex);
 	cpus_read_unlock();
@@ -3130,6 +3144,7 @@ static int rdt_get_tree(struct fs_context *fc)
 	 */
 	if (!ctx->kfc.new_sb_created)
 		resctrl_unmount();
+	kernfs_put(rdt_root_kn);
 	return ret;
 
 out_mondata:

Reinette

[1] https://sashiko.dev/#/patchset/20260508182143.14592-1-tony.luck%40intel.com?part=3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline
  2026-05-08 18:21 ` [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline Tony Luck
@ 2026-05-11 23:06   ` Reinette Chatre
  0 siblings, 0 replies; 14+ messages in thread
From: Reinette Chatre @ 2026-05-11 23:06 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches

Hi Tony,

On 5/8/26 11:21 AM, Tony Luck wrote:
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 9fd901c78dc6..02434d11e024 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -791,12 +791,38 @@ static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
>   */
>  void cqm_handle_limbo(struct work_struct *work)
>  {
> +	struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>  	struct rdt_l3_mon_domain *d;
>  
>  	cpus_read_lock();
>  	mutex_lock(&rdtgroup_mutex);
>  
> +	/*
> +	 * Worker was blocked waiting for the CPU it was running on to go
> +	 * offline. Handle two scenarios:
> +	 * - Worker was running on the last CPU of a domain. The domain and
> +	 *   thus the work_struct has been freed so do not attempt to obtain
> +	 *   domain via container_of(). All remaining domains have limbo
> +	 *   handlers so the loop will not find any domains needing a
> +	 *   limbo handler. Just exit.
> +	 * - Worker was running on CPU that just went offline with other
> +	 *   CPUs in domain still running and available to take over the
> +	 *   worker. Offline handler could not schedule a new worker on
> +	 *   another CPU in the domain but signaled that this needs to be
> +	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
> +	 *   that needs a worker and schedule it after the normal CQM
> +	 *   interval.
> +	 */
> +	if (!is_percpu_thread()) {
> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +			if (d->cqm_work_cpu == nr_cpu_ids)
> +				cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL,
> +							RESCTRL_PICK_ANY_CPU);
> +		}
> +		goto out_unlock;
> +	}
> +
>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
>  

The issue reported by sashiko [1] is not clear to me. The claim is that if above worker
is running on last CPU of a domain and is blocked at cpus_read_lock() at the time
the CPU it is running on is rapidly offlined and then onlined, then when the
worker can run it will find is_percpu_thread() to be true but the domain structure
will be freed. 
I am not familiar with the CPU hotplug locking but from what I can tell, in this
scenario, the cpus_write_lock() in _cpu_up() will block since there is a pending reader
and the worker will be able to run before the CPU online work is done. The scenario presented
thus seems to be defeated by percpu-rwsem semantics. What do you think of the scenario
presented in [1]?

Reinette


[1] https://sashiko.dev/#/patchset/20260508182143.14592-1-tony.luck%40intel.com?part=4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-11  3:15       ` Luck, Tony
@ 2026-05-12  1:51         ` Chen, Yu C
  0 siblings, 0 replies; 14+ messages in thread
From: Chen, Yu C @ 2026-05-12  1:51 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, linux-kernel, patches,
	Maciej Wieczor-Retman, Reinette Chatre, Fenghua Yu, Babu Moger,
	James Morse, Peter Newman, Drew Fustini, Dave Martin

On 5/11/2026 11:15 AM, Luck, Tony wrote:
> On Sat, May 09, 2026 at 08:43:36PM +0800, Chen, Yu C wrote:
>> On 5/9/2026 5:36 AM, Luck, Tony wrote:
>>> On Fri, May 08, 2026 at 11:21:41AM -0700, Tony Luck wrote:
>>>
>>>   From 0263035539f805f5d4bddcef8968b551354cb86d Mon Sep 17 00:00:00 2001
>>> From: Tony Luck <tony.luck@intel.com>
>>> Date: Thu, 7 May 2026 13:48:51 -0700
>>> Subject: [PATCH] fs/resctrl: Free mon_data structures on rdt_get_tree()
>>>    failure
>>>
>>> If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
>>> fails, the mon_data structures allocated by mon_get_kn_priv() are
>>> leaked. Add mon_put_kn_priv() to the out_mongrp error path to free
>>> them.
>>>
>>> Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
>>
>> I did not find above commit in vanilla kernel, should it be 2a6566038544
>> ("x86/resctrl:
>> Expand the width of domid by replacing mon_data_bits") where
>> mon_put_kn_priv() was introduced?
> 
> Thanks for the catch. Another strike against Claude. It didn't hallucinate
> this commit for the fixes tag, that commit does appear in my local repo.
> But in a dead-end branch I created to write the patch. I posted it here
> in v3 of my telemetry series:
> 
> https://lore.kernel.org/all/20250407234032.241215-2-tony.luck@intel.com/
> 
> But James Morse picked it up in v11 of his "Move resctrl filessytem code
> to fs/resctrl" series:
> 
> https://lore.kernel.org/all/20250513171547.15194-13-james.morse@arm.com/
> 
> where it was applied upstream in the 2a6566038544 commit you found.
> 
> I'm somewhat confused that Claude dug through my other branches instead
> of just looking for ancestors of the current branch. Maybe I need an
> explicit instruction to AI agents on how to find commits to use for
> Fixes: tags?

I guess so, Opus 4.6 suggested the following prompt:
"When generating Fixes: tags, only consider commits reachable from the
current branch (git log). Do not search other local branches."

thanks,
Chenyu

> 
> Claude is now double-fired.
> 
> -Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount
  2026-05-11 22:53   ` Reinette Chatre
@ 2026-05-12  7:28     ` Chen, Yu C
  2026-05-12 14:34       ` Reinette Chatre
  0 siblings, 1 reply; 14+ messages in thread
From: Chen, Yu C @ 2026-05-12  7:28 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, linux-kernel, patches,
	Maciej Wieczor-Retman, Fenghua Yu, Tony Luck, James Morse,
	Drew Fustini, Babu Moger, Peter Newman, Dave Martin

On 5/12/2026 6:53 AM, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/8/26 11:21 AM, Tony Luck wrote:

> +	 * Obtain reference with locks held to protect against interference
> +	 * from resctrl_exit().
> +	*/
> +	kernfs_get(rdt_root_kn);

[ ... ]

> @@ -3130,6 +3144,7 @@ static int rdt_get_tree(struct fs_context *fc)
>   	 */
>   	if (!ctx->kfc.new_sb_created)
>   		resctrl_unmount();
> +	kernfs_put(rdt_root_kn);

I wonder if above should be protected against
	cpus_read_lock();
	mutex_lock(&rdtgroup_mutex);
like kernfs_get()?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount
  2026-05-12  7:28     ` Chen, Yu C
@ 2026-05-12 14:34       ` Reinette Chatre
  0 siblings, 0 replies; 14+ messages in thread
From: Reinette Chatre @ 2026-05-12 14:34 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Borislav Petkov, x86, linux-kernel, patches,
	Maciej Wieczor-Retman, Fenghua Yu, Tony Luck, James Morse,
	Drew Fustini, Babu Moger, Peter Newman, Dave Martin

Hi Chenyu,

On 5/12/26 12:28 AM, Chen, Yu C wrote:
> On 5/12/2026 6:53 AM, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/8/26 11:21 AM, Tony Luck wrote:
> 
>> +     * Obtain reference with locks held to protect against interference
>> +     * from resctrl_exit().
>> +    */
>> +    kernfs_get(rdt_root_kn);
> 
> [ ... ]
> 
>> @@ -3130,6 +3144,7 @@ static int rdt_get_tree(struct fs_context *fc)
>>        */
>>       if (!ctx->kfc.new_sb_created)
>>           resctrl_unmount();
>> +    kernfs_put(rdt_root_kn);
> 
> I wonder if above should be protected against
>     cpus_read_lock();
>     mutex_lock(&rdtgroup_mutex);
> like kernfs_get()?

It is not obvious to me what this protection would be needed for.
Do you have a troublesome scenario in mind?

rdt_root_kn is a local copy of rdtgroup_default.kn. The latter is indeed 
protected by the mutex. The reason why the kernfs_get() is protected
by the mutex is to ensure what rdt_root_kn points to, rdtgroup_default.kn, remains
accessible after the mutex is dropped. Nothing else modifies rdt_root_kn. I
understand the appeal of symmetry but it is not clear to me what the extra
locking is needed for here?

Could it perhaps make this flow easier to understand if the kernfs_get() is
of the mutex protected rdtgroup_default.kn while the kernfs_put() is
of the local backup copy? For example:

	/* Ensure root kn remains accessible after mutex is unlocked */
	kernfs_get(rdtgroup_default.kn);
	/*
	 * Make backup of rdtgroup_default.kn just in case one of the
	 * following flows (that sets rdtgroup_default.kn to NULL) run after
	 * the mutex is unlocked:
	 * resctrl_exit()->resctrl_fs_teardown()->rdtgroup_destroy_root()
	 * kernfs_get_tree()->deactivate_locked_super()->rdt_kill_sb()->resctrl_unmount()->resctrl_fs_teardown()->rdtgroup_destroy_root()
	 * These flows would not actually result in rdtgroup_default.kn
	 * being removed thanks to the additional reference.
	 /
	rdt_root_kn = rdtgroup_default.kn;
	mutex_unlock(&rdtgroup_mutex);
	cpus_read_unlock();

	ret = kernfs_get_tree(fc);
	if (!ctx->kfc.new_sb_created)
		resctrl_unmount();

	kernfs_put(rdt_root_kn);
	

Reinette

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-12 14:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
2026-05-08 21:36   ` Luck, Tony
2026-05-09 12:43     ` Chen, Yu C
2026-05-11  3:15       ` Luck, Tony
2026-05-12  1:51         ` Chen, Yu C
2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
2026-05-10 13:52   ` Chen, Yu C
2026-05-11 22:53   ` Reinette Chatre
2026-05-12  7:28     ` Chen, Yu C
2026-05-12 14:34       ` Reinette Chatre
2026-05-08 18:21 ` [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline Tony Luck
2026-05-11 23:06   ` Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox