[PATCH 0/4] fs/resctrl: Fix three long-standing issues

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 0/4] fs/resctrl: Fix three long-standing issues
@ 2026-05-08 18:21 Tony Luck
  2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

Sashiko reported[0] a deadlock during mount, and a use-after-free when an
L3 domain is removed during CPU offline. Reinette found a memory leak
in the mount error path while refactoring code for a solution to the
mount hang.

Patch 1 just reorders some code so that fixes can be applied without
adding additional forward declaration of functions.

Patch 2 fixes the memory leak found by Reinette

Patch 3 fixes the mount deadlock[1]

Patch 4 fixes issues with CPU offline.[2]

N.B. Reinette did all the work for patches 3 & 4, so I listed her
as author an me as Co-developer. Those patches need her sign-off.

[0] Sashiko report:
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com

[1] This is v3 of the deadlock fix. Previous version is here:
Link: https://lore.kernel.org/all/20260504220149.157753-1-tony.luck@intel.com/

[2] This is v2 of the CPU offline fix. Previous version is here:
Link: https://lore.kernel.org/all/20260501213611.25600-1-tony.luck@intel.com/

Reinette Chatre (2):
  fs/resctrl: Fix deadlock for errors during mount
  fs/resctrl: Fix issues with worker threads when CPUs are taken offline

Tony Luck (2):
  fs/resctrl: Move functions to avoid forward references in subsequent
    fixes
  fs/resctrl: Free mon_data structures on rdt_get_tree() failure

 fs/resctrl/monitor.c  |  55 ++++++
 fs/resctrl/rdtgroup.c | 440 ++++++++++++++++++++++--------------------
 2 files changed, 287 insertions(+), 208 deletions(-)

-- 
2.54.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

No functional change. Just pull some functions before rdt_get_tree().

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 fs/resctrl/rdtgroup.c | 376 +++++++++++++++++++++---------------------
 1 file changed, 188 insertions(+), 188 deletions(-)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 5dfdaa6f9d8f..a6376a3fc4c3 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2782,6 +2782,194 @@ static void schemata_list_destroy(void)
 	}
 }
 
+/*
+ * Move tasks from one to the other group. If @from is NULL, then all tasks
+ * in the systems are moved unconditionally (used for teardown).
+ *
+ * If @mask is not NULL the cpus on which moved tasks are running are set
+ * in that mask so the update smp function call is restricted to affected
+ * cpus.
+ */
+static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
+				 struct cpumask *mask)
+{
+	struct task_struct *p, *t;
+
+	read_lock(&tasklist_lock);
+	for_each_process_thread(p, t) {
+		if (!from || is_closid_match(t, from) ||
+		    is_rmid_match(t, from)) {
+			resctrl_arch_set_closid_rmid(t, to->closid,
+						     to->mon.rmid);
+
+			/*
+			 * Order the closid/rmid stores above before the loads
+			 * in task_curr(). This pairs with the full barrier
+			 * between the rq->curr update and
+			 * resctrl_arch_sched_in() during context switch.
+			 */
+			smp_mb();
+
+			/*
+			 * If the task is on a CPU, set the CPU in the mask.
+			 * The detection is inaccurate as tasks might move or
+			 * schedule before the smp function call takes place.
+			 * In such a case the function call is pointless, but
+			 * there is no other side effect.
+			 */
+			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
+				cpumask_set_cpu(task_cpu(t), mask);
+		}
+	}
+	read_unlock(&tasklist_lock);
+}
+
+static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
+{
+	struct rdtgroup *sentry, *stmp;
+	struct list_head *head;
+
+	head = &rdtgrp->mon.crdtgrp_list;
+	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
+		rdtgroup_unassign_cntrs(sentry);
+		free_rmid(sentry->closid, sentry->mon.rmid);
+		list_del(&sentry->mon.crdtgrp_list);
+
+		if (atomic_read(&sentry->waitcount) != 0)
+			sentry->flags = RDT_DELETED;
+		else
+			rdtgroup_remove(sentry);
+	}
+}
+
+/*
+ * Forcibly remove all of subdirectories under root.
+ */
+static void rmdir_all_sub(void)
+{
+	struct rdtgroup *rdtgrp, *tmp;
+
+	/* Move all tasks to the default resource group */
+	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
+
+	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
+		/* Free any child rmids */
+		free_all_child_rdtgrp(rdtgrp);
+
+		/* Remove each rdtgroup other than root */
+		if (rdtgrp == &rdtgroup_default)
+			continue;
+
+		if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
+		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
+			rdtgroup_pseudo_lock_remove(rdtgrp);
+
+		/*
+		 * Give any CPUs back to the default group. We cannot copy
+		 * cpu_online_mask because a CPU might have executed the
+		 * offline callback already, but is still marked online.
+		 */
+		cpumask_or(&rdtgroup_default.cpu_mask,
+			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
+
+		rdtgroup_unassign_cntrs(rdtgrp);
+
+		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
+
+		kernfs_remove(rdtgrp->kn);
+		list_del(&rdtgrp->rdtgroup_list);
+
+		if (atomic_read(&rdtgrp->waitcount) != 0)
+			rdtgrp->flags = RDT_DELETED;
+		else
+			rdtgroup_remove(rdtgrp);
+	}
+	/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
+	update_closid_rmid(cpu_online_mask, &rdtgroup_default);
+
+	kernfs_remove(kn_info);
+	kernfs_remove(kn_mongrp);
+	kernfs_remove(kn_mondata);
+}
+
+/**
+ * mon_get_kn_priv() - Get the mon_data priv data for this event.
+ *
+ * The same values are used across the mon_data directories of all control and
+ * monitor groups for the same event in the same domain. Keep a list of
+ * allocated structures and re-use an existing one with the same values for
+ * @rid, @domid, etc.
+ *
+ * @rid:    The resource id for the event file being created.
+ * @domid:  The domain id for the event file being created.
+ * @mevt:   The type of event file being created.
+ * @do_sum: Whether SNC summing monitors are being created. Only set
+ *	    when @rid == RDT_RESOURCE_L3.
+ *
+ * Return: Pointer to mon_data private data of the event, NULL on failure.
+ */
+static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
+					struct mon_evt *mevt,
+					bool do_sum)
+{
+	struct mon_data *priv;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
+		if (priv->rid == rid && priv->domid == domid &&
+		    priv->sum == do_sum && priv->evt == mevt)
+			return priv;
+	}
+
+	priv = kzalloc_obj(*priv);
+	if (!priv)
+		return NULL;
+
+	priv->rid = rid;
+	priv->domid = domid;
+	priv->sum = do_sum;
+	priv->evt = mevt;
+	list_add_tail(&priv->list, &mon_data_kn_priv_list);
+
+	return priv;
+}
+
+/**
+ * mon_put_kn_priv() - Free all allocated mon_data structures.
+ *
+ * Called when resctrl file system is unmounted.
+ */
+static void mon_put_kn_priv(void)
+{
+	struct mon_data *priv, *tmp;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) {
+		list_del(&priv->list);
+		kfree(priv);
+	}
+}
+
+static void resctrl_fs_teardown(void)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	/* Cleared by rdtgroup_destroy_root() */
+	if (!rdtgroup_default.kn)
+		return;
+
+	rmdir_all_sub();
+	rdtgroup_unassign_cntrs(&rdtgroup_default);
+	mon_put_kn_priv();
+	rdt_pseudo_lock_release();
+	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
+	closid_exit();
+	schemata_list_destroy();
+	rdtgroup_destroy_root();
+}
+
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -2981,194 +3169,6 @@ static int rdt_init_fs_context(struct fs_context *fc)
 	return 0;
 }
 
-/*
- * Move tasks from one to the other group. If @from is NULL, then all tasks
- * in the systems are moved unconditionally (used for teardown).
- *
- * If @mask is not NULL the cpus on which moved tasks are running are set
- * in that mask so the update smp function call is restricted to affected
- * cpus.
- */
-static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
-				 struct cpumask *mask)
-{
-	struct task_struct *p, *t;
-
-	read_lock(&tasklist_lock);
-	for_each_process_thread(p, t) {
-		if (!from || is_closid_match(t, from) ||
-		    is_rmid_match(t, from)) {
-			resctrl_arch_set_closid_rmid(t, to->closid,
-						     to->mon.rmid);
-
-			/*
-			 * Order the closid/rmid stores above before the loads
-			 * in task_curr(). This pairs with the full barrier
-			 * between the rq->curr update and
-			 * resctrl_arch_sched_in() during context switch.
-			 */
-			smp_mb();
-
-			/*
-			 * If the task is on a CPU, set the CPU in the mask.
-			 * The detection is inaccurate as tasks might move or
-			 * schedule before the smp function call takes place.
-			 * In such a case the function call is pointless, but
-			 * there is no other side effect.
-			 */
-			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
-				cpumask_set_cpu(task_cpu(t), mask);
-		}
-	}
-	read_unlock(&tasklist_lock);
-}
-
-static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
-{
-	struct rdtgroup *sentry, *stmp;
-	struct list_head *head;
-
-	head = &rdtgrp->mon.crdtgrp_list;
-	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
-		rdtgroup_unassign_cntrs(sentry);
-		free_rmid(sentry->closid, sentry->mon.rmid);
-		list_del(&sentry->mon.crdtgrp_list);
-
-		if (atomic_read(&sentry->waitcount) != 0)
-			sentry->flags = RDT_DELETED;
-		else
-			rdtgroup_remove(sentry);
-	}
-}
-
-/*
- * Forcibly remove all of subdirectories under root.
- */
-static void rmdir_all_sub(void)
-{
-	struct rdtgroup *rdtgrp, *tmp;
-
-	/* Move all tasks to the default resource group */
-	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
-
-	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
-		/* Free any child rmids */
-		free_all_child_rdtgrp(rdtgrp);
-
-		/* Remove each rdtgroup other than root */
-		if (rdtgrp == &rdtgroup_default)
-			continue;
-
-		if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
-		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
-			rdtgroup_pseudo_lock_remove(rdtgrp);
-
-		/*
-		 * Give any CPUs back to the default group. We cannot copy
-		 * cpu_online_mask because a CPU might have executed the
-		 * offline callback already, but is still marked online.
-		 */
-		cpumask_or(&rdtgroup_default.cpu_mask,
-			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
-
-		rdtgroup_unassign_cntrs(rdtgrp);
-
-		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
-
-		kernfs_remove(rdtgrp->kn);
-		list_del(&rdtgrp->rdtgroup_list);
-
-		if (atomic_read(&rdtgrp->waitcount) != 0)
-			rdtgrp->flags = RDT_DELETED;
-		else
-			rdtgroup_remove(rdtgrp);
-	}
-	/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
-	update_closid_rmid(cpu_online_mask, &rdtgroup_default);
-
-	kernfs_remove(kn_info);
-	kernfs_remove(kn_mongrp);
-	kernfs_remove(kn_mondata);
-}
-
-/**
- * mon_get_kn_priv() - Get the mon_data priv data for this event.
- *
- * The same values are used across the mon_data directories of all control and
- * monitor groups for the same event in the same domain. Keep a list of
- * allocated structures and re-use an existing one with the same values for
- * @rid, @domid, etc.
- *
- * @rid:    The resource id for the event file being created.
- * @domid:  The domain id for the event file being created.
- * @mevt:   The type of event file being created.
- * @do_sum: Whether SNC summing monitors are being created. Only set
- *	    when @rid == RDT_RESOURCE_L3.
- *
- * Return: Pointer to mon_data private data of the event, NULL on failure.
- */
-static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
-					struct mon_evt *mevt,
-					bool do_sum)
-{
-	struct mon_data *priv;
-
-	lockdep_assert_held(&rdtgroup_mutex);
-
-	list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
-		if (priv->rid == rid && priv->domid == domid &&
-		    priv->sum == do_sum && priv->evt == mevt)
-			return priv;
-	}
-
-	priv = kzalloc_obj(*priv);
-	if (!priv)
-		return NULL;
-
-	priv->rid = rid;
-	priv->domid = domid;
-	priv->sum = do_sum;
-	priv->evt = mevt;
-	list_add_tail(&priv->list, &mon_data_kn_priv_list);
-
-	return priv;
-}
-
-/**
- * mon_put_kn_priv() - Free all allocated mon_data structures.
- *
- * Called when resctrl file system is unmounted.
- */
-static void mon_put_kn_priv(void)
-{
-	struct mon_data *priv, *tmp;
-
-	lockdep_assert_held(&rdtgroup_mutex);
-
-	list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) {
-		list_del(&priv->list);
-		kfree(priv);
-	}
-}
-
-static void resctrl_fs_teardown(void)
-{
-	lockdep_assert_held(&rdtgroup_mutex);
-
-	/* Cleared by rdtgroup_destroy_root() */
-	if (!rdtgroup_default.kn)
-		return;
-
-	rmdir_all_sub();
-	rdtgroup_unassign_cntrs(&rdtgroup_default);
-	mon_put_kn_priv();
-	rdt_pseudo_lock_release();
-	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
-	closid_exit();
-	schemata_list_destroy();
-	rdtgroup_destroy_root();
-}
-
 static void rdt_kill_sb(struct super_block *sb)
 {
 	struct rdt_resource *r;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
  2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  2026-05-08 21:36   ` Luck, Tony
  2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
  2026-05-08 18:21 ` [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline Tony Luck
  3 siblings, 1 reply; 9+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
fails, the mon_data structures allocated by mon_get_kn_priv() are
leaked. Add mon_put_kn_priv() to the out_mondata error path to free
them.

mon_get_kn_priv() and mon_put_kn_priv() moved so defined before used.

Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Assisted-by: GitHub_Copilot_CLI:claude-sonnet-4.6
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 fs/resctrl/rdtgroup.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a6376a3fc4c3..0db1a92aefbe 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3067,8 +3067,10 @@ static int rdt_get_tree(struct fs_context *fc)
 out_psl:
 	rdt_pseudo_lock_release();
 out_mondata:
-	if (resctrl_arch_mon_capable())
+	if (resctrl_arch_mon_capable()) {
+		mon_put_kn_priv();
 		kernfs_remove(kn_mondata);
+	}
 out_mongrp:
 	if (resctrl_arch_mon_capable()) {
 		rdtgroup_unassign_cntrs(&rdtgroup_default);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
  2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
  2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  2026-05-10 13:52   ` Chen, Yu C
  2026-05-08 18:21 ` [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline Tony Luck
  3 siblings, 1 reply; 9+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

From: Reinette Chatre <reinette.chatre@intel.com>

Sashiko noticed[1] a deadlock in the resctrl mount code.

rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If
superblock setup fails inside kernfs_get_tree(), the VFS calls kill_sb on
the same thread before the call returns.  rdt_kill_sb() unconditionally
attempts to acquire rdtgroup_mutex and deadlock occurs.

Move the call to kernfs_get_tree() outside of locks.

Add resctrl_unmount() helper to keep code consistent between the
rdt_get_tree() failure path and a normal unmount.

If kernfs_get_tree() fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb()
has already been called and no further cleanup is needed.

Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support")
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
---
 fs/resctrl/rdtgroup.c | 63 ++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 0db1a92aefbe..62e1e4c30f78 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2970,6 +2970,29 @@ static void resctrl_fs_teardown(void)
 	rdtgroup_destroy_root();
 }
 
+static void resctrl_unmount(void)
+{
+	struct rdt_resource *r;
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+
+	rdt_disable_ctx();
+
+	/* Put everything back to default values. */
+	for_each_alloc_capable_rdt_resource(r)
+		resctrl_arch_reset_all_ctrls(r);
+
+	resctrl_fs_teardown();
+	if (resctrl_arch_alloc_capable())
+		resctrl_arch_disable_alloc();
+	if (resctrl_arch_mon_capable())
+		resctrl_arch_disable_mon();
+	resctrl_mounted = false;
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+}
+
 static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -3043,10 +3066,6 @@ static int rdt_get_tree(struct fs_context *fc)
 	if (ret)
 		goto out_mondata;
 
-	ret = kernfs_get_tree(fc);
-	if (ret < 0)
-		goto out_psl;
-
 	if (resctrl_arch_alloc_capable())
 		resctrl_arch_enable_alloc();
 	if (resctrl_arch_mon_capable())
@@ -3062,10 +3081,19 @@ static int rdt_get_tree(struct fs_context *fc)
 						   RESCTRL_PICK_ANY_CPU);
 	}
 
-	goto out;
+	rdt_last_cmd_clear();
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+
+	ret = kernfs_get_tree(fc);
+	/*
+	 * resctrl can only be mounted once, new superblock only expected
+	 * to be created once.
+	 */
+	if (!ctx->kfc.new_sb_created)
+		resctrl_unmount();
+	return ret;
 
-out_psl:
-	rdt_pseudo_lock_release();
 out_mondata:
 	if (resctrl_arch_mon_capable()) {
 		mon_put_kn_priv();
@@ -3086,7 +3114,6 @@ static int rdt_get_tree(struct fs_context *fc)
 out_root:
 	rdtgroup_destroy_root();
 out:
-	rdt_last_cmd_clear();
 	mutex_unlock(&rdtgroup_mutex);
 	cpus_read_unlock();
 	return ret;
@@ -3173,26 +3200,8 @@ static int rdt_init_fs_context(struct fs_context *fc)
 
 static void rdt_kill_sb(struct super_block *sb)
 {
-	struct rdt_resource *r;
-
-	cpus_read_lock();
-	mutex_lock(&rdtgroup_mutex);
-
-	rdt_disable_ctx();
-
-	/* Put everything back to default values. */
-	for_each_alloc_capable_rdt_resource(r)
-		resctrl_arch_reset_all_ctrls(r);
-
-	resctrl_fs_teardown();
-	if (resctrl_arch_alloc_capable())
-		resctrl_arch_disable_alloc();
-	if (resctrl_arch_mon_capable())
-		resctrl_arch_disable_mon();
-	resctrl_mounted = false;
+	resctrl_unmount();
 	kernfs_kill_sb(sb);
-	mutex_unlock(&rdtgroup_mutex);
-	cpus_read_unlock();
 }
 
 static struct file_system_type rdt_fs_type = {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline
  2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
                   ` (2 preceding siblings ...)
  2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
@ 2026-05-08 18:21 ` Tony Luck
  3 siblings, 0 replies; 9+ messages in thread
From: Tony Luck @ 2026-05-08 18:21 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches, Tony Luck

From: Reinette Chatre <reinette.chatre@intel.com>

Sashiko noticed[1] a user-after-free in the resctrl worker thread code
where the rdt_l3_mon_domain structure was freed while the worker was blocked
waiting for locks.

The root issue is that cancel_delayed_work() does not block in the case where
the worker thread is executing. This results in the race that Sashiko noticed,
but also causes problems when the CPU that has been chosen to service the
worker thread is taken offline.

Note that worker threads are allowed to delete their own work_struct
(see comment in kernel/workqueue.c:process_one_work()) so there can't be
any problems on the return path from the worker in this case where the
work_struct was deleted by other code while the worker was executing.

Indicate failure of cancel_delayed_work() calls in resctrl_offline_cpu()
by setting d->mbm_work_cpu or d->cqm_work_cpu to nr_cpu_ids. Make the worker
threads check to see if they are no longer bound to the right CPU. In this
case search the L3 domain list for any domain(s) with the work cpu set to
nr_cpu_ids. In the case where the last CPU was removed from a domain, the
domain has been removed from the list and there is nothing to do. If the
domain still exists, then restart the worker on any of the remaining CPUs.

Remove redundant cancel_delayed_work() calls from resctrl_offline_mon_domain().

Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
---
 fs/resctrl/monitor.c  | 55 +++++++++++++++++++++++++++++++++++++++++++
 fs/resctrl/rdtgroup.c | 27 +++++++++++++++------
 2 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9fd901c78dc6..02434d11e024 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -791,12 +791,38 @@ static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
  */
 void cqm_handle_limbo(struct work_struct *work)
 {
+	struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	struct rdt_l3_mon_domain *d;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
+	/*
+	 * Worker was blocked waiting for the CPU it was running on to go
+	 * offline. Handle two scenarios:
+	 * - Worker was running on the last CPU of a domain. The domain and
+	 *   thus the work_struct has been freed so do not attempt to obtain
+	 *   domain via container_of(). All remaining domains have limbo
+	 *   handlers so the loop will not find any domains needing a
+	 *   limbo handler. Just exit.
+	 * - Worker was running on CPU that just went offline with other
+	 *   CPUs in domain still running and available to take over the
+	 *   worker. Offline handler could not schedule a new worker on
+	 *   another CPU in the domain but signaled that this needs to be
+	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
+	 *   that needs a worker and schedule it after the normal CQM
+	 *   interval.
+	 */
+	if (!is_percpu_thread()) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			if (d->cqm_work_cpu == nr_cpu_ids)
+				cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL,
+							RESCTRL_PICK_ANY_CPU);
+		}
+		goto out_unlock;
+	}
+
 	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
@@ -808,6 +834,7 @@ void cqm_handle_limbo(struct work_struct *work)
 					 delay);
 	}
 
+out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
 	cpus_read_unlock();
 }
@@ -852,6 +879,34 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+
+	/*
+	 * Worker was blocked waiting for the CPU it was running on to go
+	 * offline. Handle two scenarios:
+	 * - Worker was running on the last CPU of a domain. The domain and
+	 *   thus the work_struct has been freed so do not attempt to obtain
+	 *   domain via container_of(). All remaining domains have overflow
+	 *   handlers so the loop will not find any domains needing an
+	 *   overflow handler. Just exit.
+	 * - Worker was running on CPU that just went offline with other
+	 *   CPUs in domain still running and available to take over the
+	 *   worker. Offline handler could not schedule a new worker on
+	 *   another CPU in the domain but signaled that this needs to be
+	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
+	 *   that needs a worker and schedule it to run after the normal
+	 *   MBM interval. This is completely safe on CPUs with wide MBM
+	 *   counters. Likely OK for old CPUs with narrow counters as the
+	 *   MBM_OVERFLOW_INTERVAL was picked conservatively.
+	 */
+	if (!is_percpu_thread()) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			if (d->mbm_work_cpu == nr_cpu_ids)
+				mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL,
+							   RESCTRL_PICK_ANY_CPU);
+		}
+		goto out_unlock;
+	}
+
 	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 62e1e4c30f78..bab9afd5066e 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4343,8 +4343,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		goto out_unlock;
 
 	d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
-	if (resctrl_is_mbm_enabled())
-		cancel_delayed_work(&d->mbm_over);
+
 	if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
 		/*
 		 * When a package is going down, forcefully
@@ -4355,7 +4354,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		 * package never comes back.
 		 */
 		__check_limbo(d, true);
-		cancel_delayed_work(&d->cqm_limbo);
 	}
 
 	domain_destroy_l3_mon_state(d);
@@ -4536,13 +4534,28 @@ void resctrl_offline_cpu(unsigned int cpu)
 	d = get_mon_domain_from_cpu(cpu, l3);
 	if (d) {
 		if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) {
-			cancel_delayed_work(&d->mbm_over);
-			mbm_setup_overflow_handler(d, 0, cpu);
+			if (cancel_delayed_work(&d->mbm_over)) {
+				mbm_setup_overflow_handler(d, 0, cpu);
+			} else {
+				/*
+				 * Unable to schedule work on new CPU if it
+				 * is currently running since the re-schedule
+				 * will just force new work to run on
+				 * current CPU. Mark domain's worker as
+				 * needing to be rescheduled to be handled
+				 * by worker itself.
+				 */
+				d->mbm_work_cpu = nr_cpu_ids;
+			}
 		}
 		if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
 		    cpu == d->cqm_work_cpu && has_busy_rmid(d)) {
-			cancel_delayed_work(&d->cqm_limbo);
-			cqm_setup_limbo_handler(d, 0, cpu);
+			if (cancel_delayed_work(&d->cqm_limbo)) {
+				cqm_setup_limbo_handler(d, 0, cpu);
+			} else {
+				/* Same as mbm_work_cpu case above */
+				d->cqm_work_cpu = nr_cpu_ids;
+			}
 		}
 	}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
@ 2026-05-08 21:36   ` Luck, Tony
  2026-05-09 12:43     ` Chen, Yu C
  0 siblings, 1 reply; 9+ messages in thread
From: Luck, Tony @ 2026-05-08 21:36 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
  Cc: Borislav Petkov, x86, linux-kernel, patches

On Fri, May 08, 2026 at 11:21:41AM -0700, Tony Luck wrote:
> If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
> fails, the mon_data structures allocated by mon_get_kn_priv() are
> leaked. Add mon_put_kn_priv() to the out_mondata error path to free
> them.
> 
> mon_get_kn_priv() and mon_put_kn_priv() moved so defined before used.

My bad, leaving this stale commit comment here. Code rearrangement
in patch 1 covered this.

> Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
> Reported-by: Reinette Chatre <reinette.chatre@intel.com>
> Assisted-by: GitHub_Copilot_CLI:claude-sonnet-4.6

Hmmm. I don't think I was assisted at all. Actively sabotaged with a
patch that looks plausible, but it actually wrong seems more accurate.

> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  fs/resctrl/rdtgroup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index a6376a3fc4c3..0db1a92aefbe 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3067,8 +3067,10 @@ static int rdt_get_tree(struct fs_context *fc)
>  out_psl:
>  	rdt_pseudo_lock_release();
>  out_mondata:
> -	if (resctrl_arch_mon_capable())
> +	if (resctrl_arch_mon_capable()) {
> +		mon_put_kn_priv();

Claude put this call here ...

>  		kernfs_remove(kn_mondata);
> +	}
>  out_mongrp:
>  	if (resctrl_arch_mon_capable()) {

But it really ought to be here.

>  		rdtgroup_unassign_cntrs(&rdtgroup_default);
> -- 
> 2.54.0

Claude is fired. Replacement patch below.

-Tony

From 0263035539f805f5d4bddcef8968b551354cb86d Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Thu, 7 May 2026 13:48:51 -0700
Subject: [PATCH] fs/resctrl: Free mon_data structures on rdt_get_tree()
 failure

If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
fails, the mon_data structures allocated by mon_get_kn_priv() are
leaked. Add mon_put_kn_priv() to the out_mongrp error path to free
them.

Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 fs/resctrl/rdtgroup.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a6376a3fc4c3..506b40dc9430 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3071,6 +3071,7 @@ static int rdt_get_tree(struct fs_context *fc)
 		kernfs_remove(kn_mondata);
 out_mongrp:
 	if (resctrl_arch_mon_capable()) {
+		mon_put_kn_priv();
 		rdtgroup_unassign_cntrs(&rdtgroup_default);
 		kernfs_remove(kn_mongrp);
 	}
-- 
2.54.0

> 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-08 21:36   ` Luck, Tony
@ 2026-05-09 12:43     ` Chen, Yu C
  2026-05-11  3:15       ` Luck, Tony
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Yu C @ 2026-05-09 12:43 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, linux-kernel, patches,
	Maciej Wieczor-Retman, Reinette Chatre, Fenghua Yu, Babu Moger,
	James Morse, Peter Newman, Drew Fustini, Dave Martin

On 5/9/2026 5:36 AM, Luck, Tony wrote:
> On Fri, May 08, 2026 at 11:21:41AM -0700, Tony Luck wrote:
> 
>  From 0263035539f805f5d4bddcef8968b551354cb86d Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Thu, 7 May 2026 13:48:51 -0700
> Subject: [PATCH] fs/resctrl: Free mon_data structures on rdt_get_tree()
>   failure
> 
> If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
> fails, the mon_data structures allocated by mon_get_kn_priv() are
> leaked. Add mon_put_kn_priv() to the out_mongrp error path to free
> them.
> 
> Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")

I did not find above commit in vanilla kernel, should it be 2a6566038544 
("x86/resctrl:
Expand the width of domid by replacing mon_data_bits") where
mon_put_kn_priv() was introduced?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount
  2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
@ 2026-05-10 13:52   ` Chen, Yu C
  0 siblings, 0 replies; 9+ messages in thread
From: Chen, Yu C @ 2026-05-10 13:52 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, x86, linux-kernel, patches, Reinette Chatre,
	Maciej Wieczor-Retman, James Morse, Peter Newman, Babu Moger,
	Dave Martin, Fenghua Yu, Drew Fustini

Hi Tony,

On 5/9/2026 2:21 AM, Tony Luck wrote:
> From: Reinette Chatre <reinette.chatre@intel.com>
> 
> Sashiko noticed[1] a deadlock in the resctrl mount code.
> 
> rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If
> superblock setup fails inside kernfs_get_tree(), the VFS calls kill_sb on
> the same thread before the call returns.  rdt_kill_sb() unconditionally
> attempts to acquire rdtgroup_mutex and deadlock occurs.
> 
> Move the call to kernfs_get_tree() outside of locks.
> 
> Add resctrl_unmount() helper to keep code consistent between the
> rdt_get_tree() failure path and a normal unmount.
> 
> If kernfs_get_tree() fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb()
> has already been called and no further cleanup is needed.
> 

[ ... ]

> +
>   static int rdt_get_tree(struct fs_context *fc)
>   {
>   	struct rdt_fs_context *ctx = rdt_fc2context(fc);
> @@ -3043,10 +3066,6 @@ static int rdt_get_tree(struct fs_context *fc)
>   	if (ret)
>   		goto out_mondata;
>   
> -	ret = kernfs_get_tree(fc);
> -	if (ret < 0)
> -		goto out_psl;
> -
>   	if (resctrl_arch_alloc_capable())
>   		resctrl_arch_enable_alloc();
>   	if (resctrl_arch_mon_capable())
> @@ -3062,10 +3081,19 @@ static int rdt_get_tree(struct fs_context *fc)
>   						   RESCTRL_PICK_ANY_CPU);
>   	}
>   
> -	goto out;
> +	rdt_last_cmd_clear();
> +	mutex_unlock(&rdtgroup_mutex);
> +	cpus_read_unlock();
> +
> +	ret = kernfs_get_tree(fc);
> +	/*
> +	 * resctrl can only be mounted once, new superblock only expected
> +	 * to be created once.
> +	 */

I gradually came to understand the purpose of new_sb_created,
and how it indicates whether the resctrl state should be released
or has already been released. Initially, I thought that if 
ctx->kfc.new_sb_created
is false, this could happen either due to a failure to create a 
superblock or
a reference to an existing superblock. In the latter case, the code tears
down all resctrl state via resctrl_unmount() while still returning success.
As a result, VFS might complete the mount with an invalid resctrl state.
I later realized this scenario cannot occur, as mounting resctrl twice is
impossible - kernfs_get_tree() will not return 0 in such cases.  It might
be worthwhile adding a comment here to clarify the rationale/scenario 
behind this
for future reference IMO : )

thanks,
Chenyu

> +	if (!ctx->kfc.new_sb_created)
> +		resctrl_unmount();
> +	return ret;
>   


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
  2026-05-09 12:43     ` Chen, Yu C
@ 2026-05-11  3:15       ` Luck, Tony
  0 siblings, 0 replies; 9+ messages in thread
From: Luck, Tony @ 2026-05-11  3:15 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Borislav Petkov, x86, linux-kernel, patches,
	Maciej Wieczor-Retman, Reinette Chatre, Fenghua Yu, Babu Moger,
	James Morse, Peter Newman, Drew Fustini, Dave Martin

On Sat, May 09, 2026 at 08:43:36PM +0800, Chen, Yu C wrote:
> On 5/9/2026 5:36 AM, Luck, Tony wrote:
> > On Fri, May 08, 2026 at 11:21:41AM -0700, Tony Luck wrote:
> > 
> >  From 0263035539f805f5d4bddcef8968b551354cb86d Mon Sep 17 00:00:00 2001
> > From: Tony Luck <tony.luck@intel.com>
> > Date: Thu, 7 May 2026 13:48:51 -0700
> > Subject: [PATCH] fs/resctrl: Free mon_data structures on rdt_get_tree()
> >   failure
> > 
> > If mkdir_mondata_all() succeeds but a subsequent call in rdt_get_tree()
> > fails, the mon_data structures allocated by mon_get_kn_priv() are
> > leaked. Add mon_put_kn_priv() to the out_mongrp error path to free
> > them.
> > 
> > Fixes: ee4f0ec938ad ("fs/resctrl: Simplify allocation of mon_data structures")
> 
> I did not find above commit in vanilla kernel, should it be 2a6566038544
> ("x86/resctrl:
> Expand the width of domid by replacing mon_data_bits") where
> mon_put_kn_priv() was introduced?

Thanks for the catch. Another strike against Claude. It didn't hallucinate
this commit for the fixes tag, that commit does appear in my local repo.
But in a dead-end branch I created to write the patch. I posted it here
in v3 of my telemetry series:

https://lore.kernel.org/all/20250407234032.241215-2-tony.luck@intel.com/

But James Morse picked it up in v11 of his "Move resctrl filessytem code
to fs/resctrl" series:

https://lore.kernel.org/all/20250513171547.15194-13-james.morse@arm.com/

where it was applied upstream in the 2a6566038544 commit you found.

I'm somewhat confused that Claude dug through my other branches instead
of just looking for ancestors of the current branch. Maybe I need an
explicit instruction to AI agents on how to find commits to use for
Fixes: tags?

Claude is now double-fired.

-Tony

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-11  3:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 18:21 [PATCH 0/4] fs/resctrl: Fix three long-standing issues Tony Luck
2026-05-08 18:21 ` [PATCH 1/4] fs/resctrl: Move functions to avoid forward references in subsequent fixes Tony Luck
2026-05-08 18:21 ` [PATCH 2/4] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Tony Luck
2026-05-08 21:36   ` Luck, Tony
2026-05-09 12:43     ` Chen, Yu C
2026-05-11  3:15       ` Luck, Tony
2026-05-08 18:21 ` [PATCH 3/4] fs/resctrl: Fix deadlock for errors during mount Tony Luck
2026-05-10 13:52   ` Chen, Yu C
2026-05-08 18:21 ` [PATCH 4/4] fs/resctrl: Fix issues with worker threads when CPUs are taken offline Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox