* [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues
@ 2026-05-22 19:15 Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 1/9] fs/resctrl: Move functions to avoid forward references in subsequent fixes Reinette Chatre
` (9 more replies)
0 siblings, 10 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
v2: https://lore.kernel.org/lkml/20260515193944.15114-1-tony.luck@intel.com/
v1: https://lore.kernel.org/all/20260508182143.14592-1-tony.luck@intel.com/
While reviewing the AET series [1] Sashiko reported a deadlock during mount,
and a use-after-free when an L3 domain is removed during CPU offline. Reinette
found a memory leak in the mount error path while refactoring code for a
solution to the mount hang.
During review of V1 of this series Sashiko found a new UAF on unmount issue
that was fixed in V2.
During review of V2 Sashiko uncovered a couple more new issues: TOCTOU
involving rdtgroup_kn_put() that may lead to UAF or double-free, double
free of pseudo-locked regions, potential deadlock between resctrl unmount and
info file readers. Sashiko also found that the CPU offline fix in V2 is flawed
in its use of is_percpu_thread().
Address all issues identified. This version is significantly different from V2
because of the additional fixes and reworking of the CPU offline fix. I do not
consider this version quite "polished" but after all changes made to address
all the issues identified by Sashiko I would like to check-in with folks (and
Sashiko) on where the fixes are headed and would appreciate any feedback.
Applies against tip/master to ensure it considers pending x86/cache changes.
[1] https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com
Reinette Chatre (6):
fs/resctrl: Fix deadlock for errors during mount
fs/resctrl: Prevent use-after-free in rdtgroup_kn_put()
fs/resctrl: Fix pseudo-locking lifetime handling
fs/resctrl: Prevent deadlock and use-after-free in info file handlers
x86/resctrl: Ensure domain fully initialized before placed on RCU list
fs/resctrl: Fix UAF from worker threads when domains are removed
Tony Luck (3):
fs/resctrl: Move functions to avoid forward references in subsequent
fixes
fs/resctrl: Free mon_data structures on rdt_get_tree() failure
fs/resctrl: Fix use-after-free during unmount
arch/x86/kernel/cpu/resctrl/core.c | 18 +-
arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 +-
fs/resctrl/ctrlmondata.c | 38 +-
fs/resctrl/internal.h | 15 +-
fs/resctrl/monitor.c | 100 ++-
fs/resctrl/pseudo_lock.c | 44 +-
fs/resctrl/rdtgroup.c | 847 +++++++++++++++---------
7 files changed, 680 insertions(+), 387 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v3 1/9] fs/resctrl: Move functions to avoid forward references in subsequent fixes
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-28 10:06 ` Ben Horgan
2026-05-22 19:15 ` [PATCH v3 2/9] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Reinette Chatre
` (8 subsequent siblings)
9 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
From: Tony Luck <tony.luck@intel.com>
rdt_get_tree() manages resctrl fs mount and rdt_kill_sb() manages resctrl
fs unmount.
There is significant overlap between error cleanup during resctrl mount
failure and cleanup on resctrl unmount yet the cleanup is not done
consistently in these two flows.
Pull some cleanup functions before rdt_get_tree() in preparation for
a new helper that can be shared between mount and unmount.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- Rewrite changelog.
---
fs/resctrl/rdtgroup.c | 376 +++++++++++++++++++++---------------------
1 file changed, 188 insertions(+), 188 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index af2cbab14497..91922fe1ea08 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2792,6 +2792,194 @@ static void schemata_list_destroy(void)
}
}
+/*
+ * Move tasks from one to the other group. If @from is NULL, then all tasks
+ * in the systems are moved unconditionally (used for teardown).
+ *
+ * If @mask is not NULL the cpus on which moved tasks are running are set
+ * in that mask so the update smp function call is restricted to affected
+ * cpus.
+ */
+static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
+ struct cpumask *mask)
+{
+ struct task_struct *p, *t;
+
+ read_lock(&tasklist_lock);
+ for_each_process_thread(p, t) {
+ if (!from || is_closid_match(t, from) ||
+ is_rmid_match(t, from)) {
+ resctrl_arch_set_closid_rmid(t, to->closid,
+ to->mon.rmid);
+
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and
+ * resctrl_arch_sched_in() during context switch.
+ */
+ smp_mb();
+
+ /*
+ * If the task is on a CPU, set the CPU in the mask.
+ * The detection is inaccurate as tasks might move or
+ * schedule before the smp function call takes place.
+ * In such a case the function call is pointless, but
+ * there is no other side effect.
+ */
+ if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
+ cpumask_set_cpu(task_cpu(t), mask);
+ }
+ }
+ read_unlock(&tasklist_lock);
+}
+
+static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
+{
+ struct rdtgroup *sentry, *stmp;
+ struct list_head *head;
+
+ head = &rdtgrp->mon.crdtgrp_list;
+ list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
+ rdtgroup_unassign_cntrs(sentry);
+ free_rmid(sentry->closid, sentry->mon.rmid);
+ list_del(&sentry->mon.crdtgrp_list);
+
+ if (atomic_read(&sentry->waitcount) != 0)
+ sentry->flags = RDT_DELETED;
+ else
+ rdtgroup_remove(sentry);
+ }
+}
+
+/*
+ * Forcibly remove all of subdirectories under root.
+ */
+static void rmdir_all_sub(void)
+{
+ struct rdtgroup *rdtgrp, *tmp;
+
+ /* Move all tasks to the default resource group */
+ rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
+
+ list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
+ /* Free any child rmids */
+ free_all_child_rdtgrp(rdtgrp);
+
+ /* Remove each rdtgroup other than root */
+ if (rdtgrp == &rdtgroup_default)
+ continue;
+
+ if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
+ rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
+ rdtgroup_pseudo_lock_remove(rdtgrp);
+
+ /*
+ * Give any CPUs back to the default group. We cannot copy
+ * cpu_online_mask because a CPU might have executed the
+ * offline callback already, but is still marked online.
+ */
+ cpumask_or(&rdtgroup_default.cpu_mask,
+ &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
+
+ rdtgroup_unassign_cntrs(rdtgrp);
+
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
+
+ kernfs_remove(rdtgrp->kn);
+ list_del(&rdtgrp->rdtgroup_list);
+
+ if (atomic_read(&rdtgrp->waitcount) != 0)
+ rdtgrp->flags = RDT_DELETED;
+ else
+ rdtgroup_remove(rdtgrp);
+ }
+ /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
+ update_closid_rmid(cpu_online_mask, &rdtgroup_default);
+
+ kernfs_remove(kn_info);
+ kernfs_remove(kn_mongrp);
+ kernfs_remove(kn_mondata);
+}
+
+/**
+ * mon_get_kn_priv() - Get the mon_data priv data for this event.
+ *
+ * The same values are used across the mon_data directories of all control and
+ * monitor groups for the same event in the same domain. Keep a list of
+ * allocated structures and re-use an existing one with the same values for
+ * @rid, @domid, etc.
+ *
+ * @rid: The resource id for the event file being created.
+ * @domid: The domain id for the event file being created.
+ * @mevt: The type of event file being created.
+ * @do_sum: Whether SNC summing monitors are being created. Only set
+ * when @rid == RDT_RESOURCE_L3.
+ *
+ * Return: Pointer to mon_data private data of the event, NULL on failure.
+ */
+static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
+ struct mon_evt *mevt,
+ bool do_sum)
+{
+ struct mon_data *priv;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
+ if (priv->rid == rid && priv->domid == domid &&
+ priv->sum == do_sum && priv->evt == mevt)
+ return priv;
+ }
+
+ priv = kzalloc_obj(*priv);
+ if (!priv)
+ return NULL;
+
+ priv->rid = rid;
+ priv->domid = domid;
+ priv->sum = do_sum;
+ priv->evt = mevt;
+ list_add_tail(&priv->list, &mon_data_kn_priv_list);
+
+ return priv;
+}
+
+/**
+ * mon_put_kn_priv() - Free all allocated mon_data structures.
+ *
+ * Called when resctrl file system is unmounted.
+ */
+static void mon_put_kn_priv(void)
+{
+ struct mon_data *priv, *tmp;
+
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) {
+ list_del(&priv->list);
+ kfree(priv);
+ }
+}
+
+static void resctrl_fs_teardown(void)
+{
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ /* Cleared by rdtgroup_destroy_root() */
+ if (!rdtgroup_default.kn)
+ return;
+
+ rmdir_all_sub();
+ rdtgroup_unassign_cntrs(&rdtgroup_default);
+ mon_put_kn_priv();
+ rdt_pseudo_lock_release();
+ rdtgroup_default.mode = RDT_MODE_SHAREABLE;
+ closid_exit();
+ schemata_list_destroy();
+ rdtgroup_destroy_root();
+}
+
static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
@@ -2991,194 +3179,6 @@ static int rdt_init_fs_context(struct fs_context *fc)
return 0;
}
-/*
- * Move tasks from one to the other group. If @from is NULL, then all tasks
- * in the systems are moved unconditionally (used for teardown).
- *
- * If @mask is not NULL the cpus on which moved tasks are running are set
- * in that mask so the update smp function call is restricted to affected
- * cpus.
- */
-static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
- struct cpumask *mask)
-{
- struct task_struct *p, *t;
-
- read_lock(&tasklist_lock);
- for_each_process_thread(p, t) {
- if (!from || is_closid_match(t, from) ||
- is_rmid_match(t, from)) {
- resctrl_arch_set_closid_rmid(t, to->closid,
- to->mon.rmid);
-
- /*
- * Order the closid/rmid stores above before the loads
- * in task_curr(). This pairs with the full barrier
- * between the rq->curr update and
- * resctrl_arch_sched_in() during context switch.
- */
- smp_mb();
-
- /*
- * If the task is on a CPU, set the CPU in the mask.
- * The detection is inaccurate as tasks might move or
- * schedule before the smp function call takes place.
- * In such a case the function call is pointless, but
- * there is no other side effect.
- */
- if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
- cpumask_set_cpu(task_cpu(t), mask);
- }
- }
- read_unlock(&tasklist_lock);
-}
-
-static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
-{
- struct rdtgroup *sentry, *stmp;
- struct list_head *head;
-
- head = &rdtgrp->mon.crdtgrp_list;
- list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
- rdtgroup_unassign_cntrs(sentry);
- free_rmid(sentry->closid, sentry->mon.rmid);
- list_del(&sentry->mon.crdtgrp_list);
-
- if (atomic_read(&sentry->waitcount) != 0)
- sentry->flags = RDT_DELETED;
- else
- rdtgroup_remove(sentry);
- }
-}
-
-/*
- * Forcibly remove all of subdirectories under root.
- */
-static void rmdir_all_sub(void)
-{
- struct rdtgroup *rdtgrp, *tmp;
-
- /* Move all tasks to the default resource group */
- rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
-
- list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
- /* Free any child rmids */
- free_all_child_rdtgrp(rdtgrp);
-
- /* Remove each rdtgroup other than root */
- if (rdtgrp == &rdtgroup_default)
- continue;
-
- if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
- rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
- rdtgroup_pseudo_lock_remove(rdtgrp);
-
- /*
- * Give any CPUs back to the default group. We cannot copy
- * cpu_online_mask because a CPU might have executed the
- * offline callback already, but is still marked online.
- */
- cpumask_or(&rdtgroup_default.cpu_mask,
- &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
-
- rdtgroup_unassign_cntrs(rdtgrp);
-
- free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
-
- kernfs_remove(rdtgrp->kn);
- list_del(&rdtgrp->rdtgroup_list);
-
- if (atomic_read(&rdtgrp->waitcount) != 0)
- rdtgrp->flags = RDT_DELETED;
- else
- rdtgroup_remove(rdtgrp);
- }
- /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
- update_closid_rmid(cpu_online_mask, &rdtgroup_default);
-
- kernfs_remove(kn_info);
- kernfs_remove(kn_mongrp);
- kernfs_remove(kn_mondata);
-}
-
-/**
- * mon_get_kn_priv() - Get the mon_data priv data for this event.
- *
- * The same values are used across the mon_data directories of all control and
- * monitor groups for the same event in the same domain. Keep a list of
- * allocated structures and re-use an existing one with the same values for
- * @rid, @domid, etc.
- *
- * @rid: The resource id for the event file being created.
- * @domid: The domain id for the event file being created.
- * @mevt: The type of event file being created.
- * @do_sum: Whether SNC summing monitors are being created. Only set
- * when @rid == RDT_RESOURCE_L3.
- *
- * Return: Pointer to mon_data private data of the event, NULL on failure.
- */
-static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
- struct mon_evt *mevt,
- bool do_sum)
-{
- struct mon_data *priv;
-
- lockdep_assert_held(&rdtgroup_mutex);
-
- list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
- if (priv->rid == rid && priv->domid == domid &&
- priv->sum == do_sum && priv->evt == mevt)
- return priv;
- }
-
- priv = kzalloc_obj(*priv);
- if (!priv)
- return NULL;
-
- priv->rid = rid;
- priv->domid = domid;
- priv->sum = do_sum;
- priv->evt = mevt;
- list_add_tail(&priv->list, &mon_data_kn_priv_list);
-
- return priv;
-}
-
-/**
- * mon_put_kn_priv() - Free all allocated mon_data structures.
- *
- * Called when resctrl file system is unmounted.
- */
-static void mon_put_kn_priv(void)
-{
- struct mon_data *priv, *tmp;
-
- lockdep_assert_held(&rdtgroup_mutex);
-
- list_for_each_entry_safe(priv, tmp, &mon_data_kn_priv_list, list) {
- list_del(&priv->list);
- kfree(priv);
- }
-}
-
-static void resctrl_fs_teardown(void)
-{
- lockdep_assert_held(&rdtgroup_mutex);
-
- /* Cleared by rdtgroup_destroy_root() */
- if (!rdtgroup_default.kn)
- return;
-
- rmdir_all_sub();
- rdtgroup_unassign_cntrs(&rdtgroup_default);
- mon_put_kn_priv();
- rdt_pseudo_lock_release();
- rdtgroup_default.mode = RDT_MODE_SHAREABLE;
- closid_exit();
- schemata_list_destroy();
- rdtgroup_destroy_root();
-}
-
static void rdt_kill_sb(struct super_block *sb)
{
struct rdt_resource *r;
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 2/9] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 1/9] fs/resctrl: Move functions to avoid forward references in subsequent fixes Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-27 15:18 ` Ben Horgan
2026-05-22 19:15 ` [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount Reinette Chatre
` (7 subsequent siblings)
9 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
From: Tony Luck <tony.luck@intel.com>
If mkdir_mondata_all() or a subsequent call in rdt_get_tree() fails, the
mon_data structures allocated by mon_get_kn_priv() are leaked.
Add mon_put_kn_priv() to the out_mongrp error path to free the mon_data
structures.
Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing mon_data_bits")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- Reword changelog.
---
fs/resctrl/rdtgroup.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 91922fe1ea08..f573db9e6e84 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3081,6 +3081,7 @@ static int rdt_get_tree(struct fs_context *fc)
kernfs_remove(kn_mondata);
out_mongrp:
if (resctrl_arch_mon_capable()) {
+ mon_put_kn_priv();
rdtgroup_unassign_cntrs(&rdtgroup_default);
kernfs_remove(kn_mongrp);
}
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 1/9] fs/resctrl: Move functions to avoid forward references in subsequent fixes Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 2/9] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-28 9:45 ` Ben Horgan
2026-05-28 13:48 ` Chen Yu
2026-05-22 19:15 ` [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount Reinette Chatre
` (6 subsequent siblings)
9 siblings, 2 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
From: Tony Luck <tony.luck@intel.com>
During unmount or failure teardown all mon_data structures that contain
monitoring event file private data are freed after which kernfs nodes are
removed. However, the RDT_DELETED flag is never set for the statically
allocated default resource group.
A concurrent reader of an event file associated with the default resource
group may, after dropping kernfs active protection, block on rdtgroup_mutex
while unmount proceeds to free the file private data and destroy the kernfs
node without waiting for the reader.
When the mutex is released, the reader wakes up, observes that RDT_DELETED
is not set for the default group, and dereferences the already-freed
file private data.
Set RDT_DELETED for the default group unconditionally since the flag does
not lead to free of this statically allocated group.
Do not allow a new resctrl mount if there are any waiters on default group
of previous mount. A new mount will re-initialize the default group that
would appear to waiters from previous mount as though the default group is
accessible causing them to access the mon_data structures from the previous
mount that have been removed.
Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing mon_data_bits")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260508182143.14592-1-tony.luck%40intel.com?part=2 [1]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- Rewrite changelog to not describe code as much.
- Rework changelog to switch to "Reported-by/Closes".
- Merge the duplicate rdtgroup_remove() comment with the function comment.
- Fix changelog to not mention that RDT_DELETED flag is set conditionally.
- Change "Fixes:" tag to point to commit that introduced dynamically
allocated mon_data this bug involves.
---
fs/resctrl/rdtgroup.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index f573db9e6e84..8a1457825919 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -585,14 +585,20 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
*
* On resource group creation via a mkdir, an extra kernfs_node reference is
* taken to ensure that the rdtgroup structure remains accessible for the
- * rdtgroup_kn_unlock() calls where it is removed.
+ * rdtgroup_kn_unlock() calls where it is removed. The default group is
+ * statically allocated: it does not have an extra reference but will have
+ * RDT_DELETED set on unmount to support safe access to its associated files
+ * via rdtgroup_kn_lock_live/rdtgroup_kn_unlock().
*
- * Drop the extra reference here, then free the rdtgroup structure.
+ * For all but the default group: drop the extra reference, then free the
+ * rdtgroup structure.
*
* Return: void
*/
static void rdtgroup_remove(struct rdtgroup *rdtgrp)
{
+ if (rdtgrp == &rdtgroup_default)
+ return;
kernfs_put(rdtgrp->kn);
kfree(rdtgrp);
}
@@ -2975,6 +2981,7 @@ static void resctrl_fs_teardown(void)
mon_put_kn_priv();
rdt_pseudo_lock_release();
rdtgroup_default.mode = RDT_MODE_SHAREABLE;
+ rdtgroup_default.flags = RDT_DELETED;
closid_exit();
schemata_list_destroy();
rdtgroup_destroy_root();
@@ -3000,6 +3007,12 @@ static int rdt_get_tree(struct fs_context *fc)
goto out;
}
+ /* Avoid races from pending operations from a previous mount */
+ if (atomic_read(&rdtgroup_default.waitcount) != 0) {
+ ret = -EBUSY;
+ goto out;
+ }
+
ret = setup_rmid_lru_list();
if (ret)
goto out;
@@ -4275,6 +4288,7 @@ static int rdtgroup_setup_root(struct rdt_fs_context *ctx)
ctx->kfc.root = rdt_root;
rdtgroup_default.kn = kernfs_root_to_node(rdt_root);
+ rdtgroup_default.flags = 0;
return 0;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
` (2 preceding siblings ...)
2026-05-22 19:15 ` [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-28 10:11 ` Ben Horgan
2026-05-29 14:06 ` Chen, Yu C
2026-05-22 19:15 ` [PATCH v3 5/9] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put() Reinette Chatre
` (5 subsequent siblings)
9 siblings, 2 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If
superblock setup fails inside kernfs_get_tree(), the VFS calls kill_sb on
the same thread before the call returns. rdt_kill_sb() unconditionally
attempts to acquire rdtgroup_mutex and deadlock occurs.
Move the call to kernfs_get_tree() outside of locks. If kernfs_get_tree()
fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb() has already
been called and no further cleanup is needed.
Add an extra hold in this error path on rdtgroup_default.kn to defend against
other races destroying the root which is then dereferenced in kernfs_kill_sb()
Add resctrl_unmount() helper to keep code consistent between the
rdt_get_tree() failure path and a normal unmount.
Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- Switch to "Reported-by/Closes" in changelog
---
fs/resctrl/rdtgroup.c | 82 +++++++++++++++++++++++++++++--------------
1 file changed, 55 insertions(+), 27 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 8a1457825919..d323b515060b 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2987,10 +2987,34 @@ static void resctrl_fs_teardown(void)
rdtgroup_destroy_root();
}
+static void resctrl_unmount(void)
+{
+ struct rdt_resource *r;
+
+ cpus_read_lock();
+ mutex_lock(&rdtgroup_mutex);
+
+ rdt_disable_ctx();
+
+ /* Put everything back to default values. */
+ for_each_alloc_capable_rdt_resource(r)
+ resctrl_arch_reset_all_ctrls(r);
+
+ resctrl_fs_teardown();
+ if (resctrl_arch_alloc_capable())
+ resctrl_arch_disable_alloc();
+ if (resctrl_arch_mon_capable())
+ resctrl_arch_disable_mon();
+ resctrl_mounted = false;
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+}
+
static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
+ struct kernfs_node *rdt_root_kn;
struct rdt_l3_mon_domain *dom;
struct rdt_resource *r;
int ret;
@@ -3066,10 +3090,6 @@ static int rdt_get_tree(struct fs_context *fc)
if (ret)
goto out_mondata;
- ret = kernfs_get_tree(fc);
- if (ret < 0)
- goto out_psl;
-
if (resctrl_arch_alloc_capable())
resctrl_arch_enable_alloc();
if (resctrl_arch_mon_capable())
@@ -3085,10 +3105,37 @@ static int rdt_get_tree(struct fs_context *fc)
RESCTRL_PICK_ANY_CPU);
}
- goto out;
+ /*
+ * Ensure root kn remains accessible after mutex is unlocked so that
+ * kernfs_kill_sb() can run safely if called by kernfs_get_tree()'s
+ * failure path after creating a superblock but before taking reference
+ * on root kn.
+ */
+ kernfs_get(rdtgroup_default.kn);
+
+ /*
+ * Make backup of the current root kn being created to be used in kernfs_put().
+ * The additional reference taken above will prevent the kn from being freed
+ * before kernfs_kill_sb() can run but rdtgroup_default.kn may be set to NULL
+ * via rdtgroup_destroy_root() and its backing root (rdt_root) could be overwritten
+ * before kernfs_put() can run.
+ */
+ rdt_root_kn = rdtgroup_default.kn;
+
+ rdt_last_cmd_clear();
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+
+ ret = kernfs_get_tree(fc);
+ /*
+ * resctrl can only be mounted once, new superblock only expected
+ * to be created once.
+ */
+ if (!ctx->kfc.new_sb_created)
+ resctrl_unmount();
+ kernfs_put(rdt_root_kn);
+ return ret;
-out_psl:
- rdt_pseudo_lock_release();
out_mondata:
if (resctrl_arch_mon_capable())
kernfs_remove(kn_mondata);
@@ -3108,7 +3155,6 @@ static int rdt_get_tree(struct fs_context *fc)
out_root:
rdtgroup_destroy_root();
out:
- rdt_last_cmd_clear();
mutex_unlock(&rdtgroup_mutex);
cpus_read_unlock();
return ret;
@@ -3195,26 +3241,8 @@ static int rdt_init_fs_context(struct fs_context *fc)
static void rdt_kill_sb(struct super_block *sb)
{
- struct rdt_resource *r;
-
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
-
- rdt_disable_ctx();
-
- /* Put everything back to default values. */
- for_each_alloc_capable_rdt_resource(r)
- resctrl_arch_reset_all_ctrls(r);
-
- resctrl_fs_teardown();
- if (resctrl_arch_alloc_capable())
- resctrl_arch_disable_alloc();
- if (resctrl_arch_mon_capable())
- resctrl_arch_disable_mon();
- resctrl_mounted = false;
+ resctrl_unmount();
kernfs_kill_sb(sb);
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
}
static struct file_system_type rdt_fs_type = {
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 5/9] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put()
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
` (3 preceding siblings ...)
2026-05-22 19:15 ` [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-28 10:51 ` Ben Horgan
2026-05-22 19:15 ` [PATCH v3 6/9] fs/resctrl: Fix pseudo-locking lifetime handling Reinette Chatre
` (4 subsequent siblings)
9 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
A struct rdtgroup is reference counted via rdtgroup::waitcount. Callers
that need the structure to remain valid across a sleep (while waiting on
acquiring rdtgroup_mutex) take a reference with rdtgroup_kn_get() and
release it with rdtgroup_kn_put(). The release path is intended to serve
as the fallback freer: if the count drops to zero and the group has
already been marked RDT_DELETED, rdtgroup_kn_put() frees the structure.
The bulk teardown paths free_all_child_rdtgrp() and rmdir_all_sub()
resulting from a resctrl directory remove or resctrl fs unmount act as
the primary freer: they hold rdtgroup_mutex and free each rdtgroup whose
waitcount is zero, otherwise they set RDT_DELETED and leave the freeing
to the last waiter.
These two freers race. rdtgroup_kn_put() commits waitcount == 0 with
atomic_dec_and_test() outside rdtgroup_mutex, then reads rdtgroup::flags.
Between those two operations a concurrent caller of free_all_child_rdtgrp()
or rmdir_all_sub() (which holds the mutex) can observe waitcount == 0 via
atomic_read(), call rdtgroup_remove(), and kfree() the structure. The
subsequent read of rdtgroup::flags in rdtgroup_kn_put() is then a
use-after-free, and the structure may even be freed twice if the freed
memory happens to satisfy the RDT_DELETED flag check.
Replace the bare atomic_dec_and_test() with atomic_dec_and_mutex_lock()
so that the decrement-to-zero takes rdtgroup_mutex before the count
becomes globally visible. The inspection of rdtgroup::flags then runs
under the same mutex held by the bulk freers, making the two paths
mutually exclusive. The common case where the count does not reach
zero remains lock-free. Defer kernfs_unbreak_active_protection() until
after the mutex is dropped since kernfs active protections functionally
wrap rdtgroup_mutex. Remove resource group, which in turn drops its kernfs
reference, after kernfs protection is restored.
Fixes: b8511ccc75c0 ("x86/resctrl: Fix use-after-free when deleting resource groups")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=1
Assisted-by: GitHub_Copilot:gemini-3.1-pro
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- New patch
---
fs/resctrl/rdtgroup.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index d323b515060b..6418395877cf 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2606,15 +2606,24 @@ static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
{
- if (atomic_dec_and_test(&rdtgrp->waitcount) &&
- (rdtgrp->flags & RDT_DELETED)) {
+ bool needs_free;
+
+ if (!atomic_dec_and_mutex_lock(&rdtgrp->waitcount, &rdtgroup_mutex)) {
+ kernfs_unbreak_active_protection(kn);
+ return;
+ }
+
+ needs_free = rdtgrp->flags & RDT_DELETED;
+
+ mutex_unlock(&rdtgroup_mutex);
+
+ kernfs_unbreak_active_protection(kn);
+
+ if (needs_free) {
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
rdtgroup_pseudo_lock_remove(rdtgrp);
- kernfs_unbreak_active_protection(kn);
rdtgroup_remove(rdtgrp);
- } else {
- kernfs_unbreak_active_protection(kn);
}
}
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 6/9] fs/resctrl: Fix pseudo-locking lifetime handling
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
` (4 preceding siblings ...)
2026-05-22 19:15 ` [PATCH v3 5/9] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put() Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-28 10:56 ` Ben Horgan
2026-05-22 19:15 ` [PATCH v3 7/9] fs/resctrl: Prevent deadlock and use-after-free in info file handlers Reinette Chatre
` (3 subsequent siblings)
9 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
rmdir_all_sub() implements resctrl unmount teardown by iterating through
all resource groups. Memory cleanup is coordinated by deferring resource
group removal to rdtgroup_kn_put() if there are lingering waiters.
A couple of issues exist in the pseudo-locking lifetime management:
1) rmdir_all_sub() unconditionally invokes rdtgroup_pseudo_lock_remove() on
pseudo-locked groups, even if active references remain. When the deferred
rdtgroup_kn_put() is finally executed by the last waiter, it erroneously
calls rdtgroup_pseudo_lock_remove() a second time, resulting in a
NULL-pointer dereference as the pseudo-lock structures have already been
freed.
2) pseudo_lock_dev_release() drops the waitcount reference but misses
checking if the resource group has been flagged for deletion. If
resctrl is unmounted while a pseudo-lock device file is open,
rmdir_all_sub() will observe the active waitcount and defer the deletion
to the last reference holder. When the device file descriptor is
subsequently closed, pseudo_lock_dev_release() decrements the
waitcount without inspecting nor acting on the RDT_DELETED flag.
Neither the pseudo-lock memory cleanup nor the final resource group
removal are executed.
3) rdtgroup_pseudo_lock_remove() calls debugfs_remove_recursive() that
will wait for any existing debugfs users to complete before it can
do cleanup. Since the pseudo-locking debugfs handlers take
rdtgroup_mutex it is required that this debugfs_remove_recursive()
is not called with rdtgroup_mutex held, yet rmdir_all_sub() does so.
4) A pseudo-locked group's RMID is freed when it is created. On unmount
rmdir_all_sub() unconditionally frees all RMID of all groups, resulting
in a double-free of the pseudo-locked group's RMID. The additional
consequence of this is that the original free results in the
pseudo-locked group's RMID being added to the rmid_free_lru linked
list and the second free then attempts to add the same RMID entry to
the rmid_free_lru again.
Fix pseudo-locking lifetime handling by separating the pseudo-locking
infrastructure removal from the pseudo-locking region memory teardown, and
splitting pseudo-locking infrastructure removal into what needs rdtgroup_mutex
protection and what does not. With this separation an unmount of resctrl
fs can proceed to remove the pseudo-locking infrastructure of a pseudo-locked
region since the global infrastructure it depends on will be removed soon
after (resctrl_unmount()-> resctrl_fs_teardown()->rdt_pseudo_lock_release()).
Any active users of the pseudo-locked region via an open file can complete
safely afterwards and only need to do the pseudo-locked region specific
memory teardown.
The new RDT_DELETED_PLR resource group flag communicates to waiters whether
the pseudo-locked region's infrastructure has already been removed.
Fixes: 746e08590b86 ("x86/intel_rdt: Create character device exposing pseudo-locked region")
Fixes: e0bdfe8e36f3 ("x86/intel_rdt: Support creation/removal of pseudo-locked region")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=3
Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=1
Assisted-by: GitHub_Copilot:gemini-3.1-pro
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- New patch
---
fs/resctrl/internal.h | 12 +++++++++++
fs/resctrl/pseudo_lock.c | 44 +++++++++++++++++++++++++++++++++-------
fs/resctrl/rdtgroup.c | 35 ++++++++++++++++++--------------
3 files changed, 69 insertions(+), 22 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 48af75b9dc85..e7e415ee7766 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -234,6 +234,15 @@ struct rdtgroup {
/* rdtgroup.flags */
#define RDT_DELETED 1
+/*
+ * RDT_DELETED_PLR is set when the pseudo-locked group's infrastructure
+ * (its associated device, debugfs files, etc.) has been deleted via
+ * rdtgroup_pseudo_lock_remove(). This can be done while there are
+ * references to the pseudo-locked region since the pseudo-locked region
+ * self is freed separately via pseudo_lock_free() after there are no more
+ * references.
+ */
+#define RDT_DELETED_PLR 2
/* rftype.flags */
#define RFTYPE_FLAGS_CPUS_LIST 1
@@ -334,6 +343,7 @@ void rdt_last_cmd_puts(const char *s);
__printf(1, 2)
void rdt_last_cmd_printf(const char *fmt, ...);
+void rdtgroup_remove(struct rdtgroup *rdtgrp);
struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
void rdtgroup_kn_unlock(struct kernfs_node *kn);
@@ -484,6 +494,7 @@ void rdt_pseudo_lock_release(void);
int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
+void pseudo_lock_free(struct rdtgroup *rdtgrp);
#else
static inline int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp)
@@ -514,6 +525,7 @@ static inline int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp)
}
static inline void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) { }
+static inline void pseudo_lock_free(struct rdtgroup *rdtgrp) { }
#endif /* CONFIG_RESCTRL_FS_PSEUDO_LOCK */
#endif /* _FS_RESCTRL_INTERNAL_H */
diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c
index d1cb0986006e..f9d180eb699e 100644
--- a/fs/resctrl/pseudo_lock.c
+++ b/fs/resctrl/pseudo_lock.c
@@ -333,8 +333,10 @@ static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr)
*
* Return: void
*/
-static void pseudo_lock_free(struct rdtgroup *rdtgrp)
+void pseudo_lock_free(struct rdtgroup *rdtgrp)
{
+ if (!rdtgrp->plr)
+ return;
pseudo_lock_region_clear(rdtgrp->plr);
kfree(rdtgrp->plr);
rdtgrp->plr = NULL;
@@ -928,22 +930,37 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp)
{
struct pseudo_lock_region *plr = rdtgrp->plr;
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (rdtgrp->mode != RDT_MODE_PSEUDO_LOCKSETUP &&
+ rdtgrp->mode != RDT_MODE_PSEUDO_LOCKED)
+ return;
+
+ if (rdtgrp->flags & RDT_DELETED_PLR)
+ return;
+
+ rdtgrp->flags |= RDT_DELETED_PLR;
+
if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
/*
* Default group cannot be a pseudo-locked region so we can
* free closid here.
*/
closid_free(rdtgrp->closid);
- goto free;
+ return;
}
pseudo_lock_cstates_relax(plr);
- debugfs_remove_recursive(rdtgrp->plr->debugfs_dir);
+ /*
+ * Drop rdtgroup_mutex to enable debugfs_remove_recursive() to
+ * complete as it waits for active users that may be blocked
+ * waiting on rdtgroup_mutex to complete.
+ */
+ mutex_unlock(&rdtgroup_mutex);
+ debugfs_remove_recursive(plr->debugfs_dir);
+ mutex_lock(&rdtgroup_mutex);
device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor));
pseudo_lock_minor_release(plr->minor);
-
-free:
- pseudo_lock_free(rdtgrp);
}
static int pseudo_lock_dev_open(struct inode *inode, struct file *filp)
@@ -971,6 +988,7 @@ static int pseudo_lock_dev_open(struct inode *inode, struct file *filp)
static int pseudo_lock_dev_release(struct inode *inode, struct file *filp)
{
struct rdtgroup *rdtgrp;
+ bool needs_free = false;
mutex_lock(&rdtgroup_mutex);
rdtgrp = filp->private_data;
@@ -980,8 +998,20 @@ static int pseudo_lock_dev_release(struct inode *inode, struct file *filp)
return -ENODEV;
}
filp->private_data = NULL;
- atomic_dec(&rdtgrp->waitcount);
+
+ if (atomic_dec_and_test(&rdtgrp->waitcount) &&
+ (rdtgrp->flags & RDT_DELETED)) {
+ needs_free = true;
+ rdtgroup_pseudo_lock_remove(rdtgrp);
+ }
+
mutex_unlock(&rdtgroup_mutex);
+
+ if (needs_free) {
+ pseudo_lock_free(rdtgrp);
+ rdtgroup_remove(rdtgrp);
+ }
+
return 0;
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 6418395877cf..a8b4ac7dd823 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -595,7 +595,7 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
*
* Return: void
*/
-static void rdtgroup_remove(struct rdtgroup *rdtgrp)
+void rdtgroup_remove(struct rdtgroup *rdtgrp)
{
if (rdtgrp == &rdtgroup_default)
return;
@@ -2606,23 +2606,24 @@ static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
{
- bool needs_free;
+ bool needs_free = false;
if (!atomic_dec_and_mutex_lock(&rdtgrp->waitcount, &rdtgroup_mutex)) {
kernfs_unbreak_active_protection(kn);
return;
}
- needs_free = rdtgrp->flags & RDT_DELETED;
+ if (rdtgrp->flags & RDT_DELETED) {
+ needs_free = true;
+ rdtgroup_pseudo_lock_remove(rdtgrp);
+ }
mutex_unlock(&rdtgroup_mutex);
kernfs_unbreak_active_protection(kn);
if (needs_free) {
- if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
- rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
- rdtgroup_pseudo_lock_remove(rdtgrp);
+ pseudo_lock_free(rdtgrp);
rdtgroup_remove(rdtgrp);
}
}
@@ -2885,10 +2886,6 @@ static void rmdir_all_sub(void)
if (rdtgrp == &rdtgroup_default)
continue;
- if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
- rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
- rdtgroup_pseudo_lock_remove(rdtgrp);
-
/*
* Give any CPUs back to the default group. We cannot copy
* cpu_online_mask because a CPU might have executed the
@@ -2899,15 +2896,23 @@ static void rmdir_all_sub(void)
rdtgroup_unassign_cntrs(rdtgrp);
- free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
-
kernfs_remove(rdtgrp->kn);
list_del(&rdtgrp->rdtgroup_list);
- if (atomic_read(&rdtgrp->waitcount) != 0)
- rdtgrp->flags = RDT_DELETED;
- else
+ if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
+ rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
+ rdtgroup_pseudo_lock_remove(rdtgrp);
+ } else {
+ /* Pseudo-locked group's RMID is freed during setup. */
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
+ }
+
+ if (atomic_read(&rdtgrp->waitcount) != 0) {
+ rdtgrp->flags |= RDT_DELETED;
+ } else {
+ pseudo_lock_free(rdtgrp);
rdtgroup_remove(rdtgrp);
+ }
}
/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
update_closid_rmid(cpu_online_mask, &rdtgroup_default);
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 7/9] fs/resctrl: Prevent deadlock and use-after-free in info file handlers
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
` (5 preceding siblings ...)
2026-05-22 19:15 ` [PATCH v3 6/9] fs/resctrl: Fix pseudo-locking lifetime handling Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list Reinette Chatre
` (2 subsequent siblings)
9 siblings, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
resctrl provides files under the info/ directory to expose global
configuration and capabilities to userspace. These files are instantiated
statically during filesystem mount and expose data associated with internal
schema structures via kernfs private pointers.
A potential deadlock exists between userspace readers of these info files
and the unmount filesystem teardown process. Reading an info file invokes
kernfs which acquires an active reference, after which the handler typically
attempts to acquire the rdtgroup_mutex. Concurrently, unmounting the
filesystem holds the rdtgroup_mutex and then attempts to recursively
remove the info kernfs nodes involving kernfs_drain() which blocks until
all active references are released. Another problem exists where info files
might be accessed from an outdated mount if the filesystem is unmounted and
remounted during a reader's execution, leading to a use-after-free when
reading the now-deleted private schema data.
Introduce info_kn_lock() and info_kn_unlock() helpers to coordinate locking
across all info handlers. These helpers mirror similar logic used by resource
group handlers by deliberately breaking the kernfs active protection before
attempting to acquire the rdtgroup_mutex, preventing the deadlock. To guard
against the vulnerability from rapid mount cycling, info_kn_lock() securely
walks the parent lineage of the kernfs node under an RCU section to confirm
the node belongs to the globally active root before permitting the operation
to proceed. Convert all info file handlers to use this helper and only
de-reference the schema after it determined safe to do so.
Make no attempt to output error message to last_cmd_status on failure
since failure implies there is no filesystem with which to display error
to user space.
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=3
Assisted-by: GitHub_Copilot:gemini-3.1-pro
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- New patch
---
fs/resctrl/ctrlmondata.c | 38 ++++----
fs/resctrl/internal.h | 3 +-
fs/resctrl/monitor.c | 48 +++++-----
fs/resctrl/rdtgroup.c | 192 ++++++++++++++++++++++++++++++++-------
4 files changed, 203 insertions(+), 78 deletions(-)
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 9a7dfc48cb2e..b95bf6208be2 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -769,10 +769,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
int resctrl_io_alloc_show(struct kernfs_open_file *of, struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
if (r->cache.io_alloc_capable) {
if (resctrl_arch_get_io_alloc_enabled(r))
seq_puts(seq, "enabled\n");
@@ -782,7 +784,7 @@ int resctrl_io_alloc_show(struct kernfs_open_file *of, struct seq_file *seq, voi
seq_puts(seq, "not supported\n");
}
- mutex_unlock(&rdtgroup_mutex);
+ info_kn_unlock(of->kn);
return 0;
}
@@ -847,7 +849,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
char const *grp_name;
u32 io_alloc_closid;
bool enable;
@@ -857,9 +859,10 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file *of, char *buf,
if (ret)
return ret;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
rdt_last_cmd_clear();
if (!r->cache.io_alloc_capable) {
@@ -907,8 +910,7 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file *of, char *buf,
}
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret ?: nbytes;
}
@@ -916,14 +918,15 @@ ssize_t resctrl_io_alloc_write(struct kernfs_open_file *of, char *buf,
int resctrl_io_alloc_cbm_show(struct kernfs_open_file *of, struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
int ret = 0;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
+ r = s->res;
if (!r->cache.io_alloc_capable) {
rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name);
ret = -ENODEV;
@@ -945,8 +948,7 @@ int resctrl_io_alloc_cbm_show(struct kernfs_open_file *of, struct seq_file *seq,
show_doms(seq, s, NULL, resctrl_io_alloc_closid(r));
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret;
}
@@ -1013,7 +1015,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
u32 io_alloc_closid;
int ret = 0;
@@ -1023,10 +1025,11 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open_file *of, char *buf,
buf[nbytes - 1] = '\0';
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
+ r = s->res;
if (!r->cache.io_alloc_capable) {
rdt_last_cmd_printf("io_alloc is not supported on %s\n", s->name);
ret = -ENODEV;
@@ -1051,8 +1054,7 @@ ssize_t resctrl_io_alloc_cbm_write(struct kernfs_open_file *of, char *buf,
out_clear_configs:
rdt_staged_configs_clear();
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret ?: nbytes;
}
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index e7e415ee7766..4e3173f25e92 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -345,8 +345,9 @@ void rdt_last_cmd_printf(const char *fmt, ...);
void rdtgroup_remove(struct rdtgroup *rdtgrp);
struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
-
void rdtgroup_kn_unlock(struct kernfs_node *kn);
+bool info_kn_lock(struct kernfs_node *kn);
+void info_kn_unlock(struct kernfs_node *kn);
int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 0e6a389a16bf..4565b9864a9e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -1052,7 +1052,8 @@ int event_filter_show(struct kernfs_open_file *of, struct seq_file *seq, void *v
bool sep = false;
int ret = 0, i;
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
r = resctrl_arch_get_resource(mevt->rid);
@@ -1073,7 +1074,7 @@ int event_filter_show(struct kernfs_open_file *of, struct seq_file *seq, void *v
seq_putc(seq, '\n');
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
+ info_kn_unlock(of->kn);
return ret;
}
@@ -1084,7 +1085,8 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_open_file *of, struct seq_fil
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
int ret = 0;
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
@@ -1096,7 +1098,7 @@ int resctrl_mbm_assign_on_mkdir_show(struct kernfs_open_file *of, struct seq_fil
seq_printf(s, "%u\n", r->mon.mbm_assign_on_mkdir);
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
+ info_kn_unlock(of->kn);
return ret;
}
@@ -1112,7 +1114,8 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kernfs_open_file *of, char *buf
if (ret)
return ret;
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
@@ -1124,7 +1127,7 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kernfs_open_file *of, char *buf
r->mon.mbm_assign_on_mkdir = value;
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
+ info_kn_unlock(of->kn);
return ret ?: nbytes;
}
@@ -1414,8 +1417,8 @@ ssize_t event_filter_write(struct kernfs_open_file *of, char *buf, size_t nbytes
buf[nbytes - 1] = '\0';
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
@@ -1438,8 +1441,7 @@ ssize_t event_filter_write(struct kernfs_open_file *of, char *buf, size_t nbytes
}
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret ?: nbytes;
}
@@ -1450,7 +1452,8 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_file *of,
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
bool enabled;
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
enabled = resctrl_arch_mbm_cntr_assign_enabled(r);
if (r->mon.mbm_cntr_assignable) {
@@ -1469,7 +1472,7 @@ int resctrl_mbm_assign_mode_show(struct kernfs_open_file *of,
seq_puts(s, "[default]\n");
}
- mutex_unlock(&rdtgroup_mutex);
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1488,8 +1491,8 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of, char *buf,
buf[nbytes - 1] = '\0';
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
@@ -1547,8 +1550,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of, char *buf,
}
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret ?: nbytes;
}
@@ -1560,8 +1562,8 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
struct rdt_l3_mon_domain *dom;
bool sep = false;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
list_for_each_entry(dom, &r->mon_domains, hdr.list) {
if (sep)
@@ -1572,8 +1574,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
}
seq_putc(s, '\n');
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1586,8 +1587,8 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
u32 cntrs, i;
int ret = 0;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
@@ -1613,8 +1614,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
seq_putc(s, '\n');
out_unlock:
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret;
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a8b4ac7dd823..6601b138ac7a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -977,13 +977,14 @@ static int rdt_last_cmd_status_show(struct kernfs_open_file *of,
{
int len;
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
len = seq_buf_used(&last_cmd_status);
if (len)
seq_printf(seq, "%.*s", len, last_cmd_status_buf);
else
seq_puts(seq, "ok\n");
- mutex_unlock(&rdtgroup_mutex);
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1002,7 +1003,11 @@ static int rdt_num_closids_show(struct kernfs_open_file *of,
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
seq_printf(seq, "%u\n", s->num_closid);
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1010,9 +1015,14 @@ static int rdt_default_ctrl_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
seq_printf(seq, "%x\n", resctrl_get_default_ctrl(r));
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1020,9 +1030,15 @@ static int rdt_min_cbm_bits_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
seq_printf(seq, "%u\n", r->cache.min_cbm_bits);
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1030,9 +1046,14 @@ static int rdt_shareable_bits_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
seq_printf(seq, "%x\n", r->cache.shareable_bits);
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1060,15 +1081,16 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
*/
unsigned long sw_shareable = 0, hw_shareable = 0;
unsigned long exclusive = 0, pseudo_locked = 0;
- struct rdt_resource *r = s->res;
struct rdt_ctrl_domain *dom;
int i, hwb, swb, excl, psl;
+ struct rdt_resource *r;
enum rdtgrp_mode mode;
bool sep = false;
u32 ctrl_val;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
if (sep)
seq_putc(seq, ';');
@@ -1144,8 +1166,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
sep = true;
}
seq_putc(seq, '\n');
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1153,9 +1174,14 @@ static int rdt_min_bw_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
seq_printf(seq, "%u\n", r->membw.min_bw);
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1164,8 +1190,12 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
seq_printf(seq, "%u\n", r->mon.num_rmid);
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1175,6 +1205,8 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
struct mon_evt *mevt;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
for_each_mon_event(mevt) {
if (mevt->rid != r->rid || !mevt->enabled)
continue;
@@ -1184,6 +1216,8 @@ static int rdt_mon_features_show(struct kernfs_open_file *of,
seq_printf(seq, "%s_config\n", mevt->name);
}
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1191,9 +1225,14 @@ static int rdt_bw_gran_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
seq_printf(seq, "%u\n", r->membw.bw_gran);
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1201,16 +1240,24 @@ static int rdt_delay_linear_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
seq_printf(seq, "%u\n", r->membw.delay_linear);
+ info_kn_unlock(of->kn);
+
return 0;
}
static int max_threshold_occ_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold);
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1219,22 +1266,28 @@ static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
switch (r->membw.throttle_mode) {
case THREAD_THROTTLE_PER_THREAD:
seq_puts(seq, "per-thread\n");
- return 0;
+ break;
case THREAD_THROTTLE_MAX:
seq_puts(seq, "max\n");
- return 0;
+ break;
case THREAD_THROTTLE_UNDEFINED:
seq_puts(seq, "undefined\n");
- return 0;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ break;
}
- WARN_ON_ONCE(1);
-
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1248,12 +1301,20 @@ static ssize_t max_threshold_occ_write(struct kernfs_open_file *of,
if (ret)
return ret;
- if (bytes > resctrl_rmid_realloc_limit)
- return -EINVAL;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+
+ if (bytes > resctrl_rmid_realloc_limit) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(bytes);
- return nbytes;
+out_unlock:
+ info_kn_unlock(of->kn);
+
+ return ret ?: nbytes;
}
/*
@@ -1293,10 +1354,15 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
struct resctrl_schema *s = rdt_kn_parent_priv(of->kn);
- struct rdt_resource *r = s->res;
+ struct rdt_resource *r;
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+ r = s->res;
seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks);
+ info_kn_unlock(of->kn);
+
return 0;
}
@@ -1652,8 +1718,8 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
struct rdt_l3_mon_domain *dom;
bool sep = false;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ lockdep_assert_cpus_held();
+ lockdep_assert_held(&rdtgroup_mutex);
list_for_each_entry(dom, &r->mon_domains, hdr.list) {
if (sep)
@@ -1670,8 +1736,6 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
}
seq_puts(s, "\n");
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
return 0;
}
@@ -1681,8 +1745,12 @@ static int mbm_total_bytes_config_show(struct kernfs_open_file *of,
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+
mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID);
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1691,8 +1759,12 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
+
mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID);
+ info_kn_unlock(of->kn);
return 0;
}
@@ -1790,8 +1862,8 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
if (nbytes == 0 || buf[nbytes - 1] != '\n')
return -EINVAL;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
@@ -1799,8 +1871,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID);
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret ?: nbytes;
}
@@ -1816,8 +1887,8 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
if (nbytes == 0 || buf[nbytes - 1] != '\n')
return -EINVAL;
- cpus_read_lock();
- mutex_lock(&rdtgroup_mutex);
+ if (!info_kn_lock(of->kn))
+ return -ENOENT;
rdt_last_cmd_clear();
@@ -1825,8 +1896,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of,
ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID);
- mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
+ info_kn_unlock(of->kn);
return ret ?: nbytes;
}
@@ -2660,6 +2730,58 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
rdtgroup_kn_put(rdtgrp, kn);
}
+/*
+ * Accessing the kn after breaking active protection is safe since the open
+ * of resctrl file holds a kernfs base reference (different from active
+ * protection) on the kn ensuring that it remains accessible even if it was
+ * unlinked. Each kn in turn holds base reference to parent so the kn's
+ * genealogy remains in memory until all base references dropped.
+ */
+static bool is_active_resctrl_node(struct kernfs_node *kn)
+{
+ struct kernfs_node *p;
+ bool match = false;
+
+ guard(rcu)();
+ p = kn;
+ while (p) {
+ if (p == rdtgroup_default.kn) {
+ match = true;
+ break;
+ }
+ p = rcu_dereference(p->__parent);
+ }
+
+ return match;
+}
+
+bool info_kn_lock(struct kernfs_node *kn)
+{
+ kernfs_break_active_protection(kn);
+ cpus_read_lock();
+ mutex_lock(&rdtgroup_mutex);
+
+ /*
+ * Check both if resctrl is torn down (!rdtgroup_default.kn) and
+ * if the reader's kernfs_node originates from a dead mount.
+ */
+ if (!rdtgroup_default.kn || !is_active_resctrl_node(kn)) {
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+ kernfs_unbreak_active_protection(kn);
+ return false;
+ }
+
+ return true;
+}
+
+void info_kn_unlock(struct kernfs_node *kn)
+{
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+ kernfs_unbreak_active_protection(kn);
+}
+
static int mkdir_mondata_all(struct kernfs_node *parent_kn,
struct rdtgroup *prgrp,
struct kernfs_node **mon_data_kn);
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
` (6 preceding siblings ...)
2026-05-22 19:15 ` [PATCH v3 7/9] fs/resctrl: Prevent deadlock and use-after-free in info file handlers Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-28 16:11 ` Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed Reinette Chatre
2026-05-28 20:08 ` [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Luck, Tony
9 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
A resctrl domain consists of the domain structure self that includes
pointers to dynamically allocated filesystem as well as architecture
specific data. For example, the L3 monitoring domain structure consists
of the architecture specific struct rdt_hw_l3_mon_domain that contains
the dynamically allocated rdt_hw_l3_mon_domain::arch_mbm_states
architectural state and the embedded struct rdt_l3_mon_domain contains
the dynamically allocated rdt_l3_mon_domain::mbm_states resctrl fs state.
The domains are placed on an RCU protected list so that readers could
access domains via cpus_read_lock() or from an RCU read-side critical
section.
A reader accessing a domain via the RCU list expects that the domain and
all its dynamically allocated data is accessible. Only place domain on RCU
list when all its dynamically allocated data is ready, similarly unlink
from RCU list before removing any of its dynamically allocated data.
There are no readers accessing a domain via RCU list. Ensure safety of
access when such reader arrives.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V2:
- New patch
---
arch/x86/kernel/cpu/resctrl/core.c | 18 +++++++-----------
arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 ++---
2 files changed, 9 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9c01d2562b7a..bca782050198 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -515,14 +515,12 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
- list_add_tail_rcu(&d->hdr.list, add_pos);
-
err = resctrl_online_ctrl_domain(r, d);
if (err) {
- list_del_rcu(&d->hdr.list);
- synchronize_rcu();
ctrl_domain_free(hw_dom);
+ return;
}
+ list_add_tail_rcu(&d->hdr.list, add_pos);
}
static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
@@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
return;
}
- list_add_tail_rcu(&d->hdr.list, add_pos);
-
err = resctrl_online_mon_domain(r, &d->hdr);
if (err) {
- list_del_rcu(&d->hdr.list);
- synchronize_rcu();
l3_mon_domain_free(hw_dom);
+ return;
}
+ list_add_tail_rcu(&d->hdr.list, add_pos);
}
static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
@@ -642,9 +638,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
hw_dom = resctrl_to_arch_ctrl_dom(d);
- resctrl_offline_ctrl_domain(r, d);
list_del_rcu(&hdr->list);
synchronize_rcu();
+ resctrl_offline_ctrl_domain(r, d);
/*
* rdt_ctrl_domain "d" is going to be freed below, so clear
@@ -689,9 +685,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
- resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
synchronize_rcu();
+ resctrl_offline_mon_domain(r, hdr);
l3_mon_domain_free(hw_dom);
break;
}
@@ -702,9 +698,9 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
pkgd = container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr);
- resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
synchronize_rcu();
+ resctrl_offline_mon_domain(r, hdr);
kfree(pkgd);
break;
}
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 89b8b619d5d5..c22c3cf5167d 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -398,12 +398,11 @@ void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
d->hdr.type = RESCTRL_MON_DOMAIN;
d->hdr.rid = RDT_RESOURCE_PERF_PKG;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- list_add_tail_rcu(&d->hdr.list, add_pos);
err = resctrl_online_mon_domain(r, &d->hdr);
if (err) {
- list_del_rcu(&d->hdr.list);
- synchronize_rcu();
kfree(d);
+ return;
}
+ list_add_tail_rcu(&d->hdr.list, add_pos);
}
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
` (7 preceding siblings ...)
2026-05-22 19:15 ` [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list Reinette Chatre
@ 2026-05-22 19:15 ` Reinette Chatre
2026-05-26 15:32 ` Luck, Tony
2026-05-28 16:12 ` Reinette Chatre
2026-05-28 20:08 ` [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Luck, Tony
9 siblings, 2 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-22 19:15 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches, reinette.chatre
The mbm_handle_overflow() and cqm_handle_limbo() workers read event
counters and may sleep while doing so. They are scheduled via
delayed_work embedded in struct rdt_l3_mon_domain. Architecture allocates
and frees these domains from CPU hotplug callbacks under cpus_write_lock(),
and the workers acquire cpus_read_lock() to keep the domain alive across
their access.
A use-after-free can occur when a worker is blocked waiting for
cpus_read_lock() while the hotplug core holds cpus_write_lock():
the architecture frees the rdt_l3_mon_domain that contains the worker's
work_struct. When the worker unblocks, the container_of() it performs on
the embedded work pointer dereferences freed memory.
Drop cpus_read_lock() from the workers and instead drain pending and
in-flight work synchronously before the architecture can free the domain.
Since architecture offlines the domain under cpus_write_lock() after it has
been unlinked from the RCU list and a grace period has elapsed no new work
can be scheduled. The cancel only needs to wait out existing work.
Drop rdtgroup_mutex during CPU offline around cancel_delayed_work_sync()
so that a worker waiting on the mutex can complete before re-pinning the
work on a different CPU.
Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com # [1]
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since v2:
- Rewrite changelog
- v2 attempted to solve the issue by using is_percpu_thread() within the
worker to learn if CPU worker was running on is going offline. A
Sashiko (https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=5)
pointed out that this would not be able to handle the scenario if one
of the hotplug handlers following the resctrl offline handlers failed.
- Some other fixes attempted that failed:
- Switch to accessing domain structure in handler via RCU so that CPU
hotplug lock no longer needed. Use cancel_delayed_work_sync() with
mutex dropped to cancel worker. Running worker from RCU read-side
critical section is a problem since the worker needs to be
able to sleep (mbm_handle_overflow()->mbm_update()->
mbm_update_one_event()->resctrl_arch_mon_ctx_alloc()->
might_sleep())
- Adding a reference count to the domain structure to avoid the worker
needing to take CPU hotplug lock. This ended up being very complicated
with the architecture needing new APIs to manage the reference count
which cannot cleanly integrate into MPAM since it uses a single
architecture domain structure to contain both the control and monitoring
domain structures. Managing the references across mount, unmount,
online, offline, as well as worker self exit resulted in several
asymmetrical and complicated paths that were error prone. Locking also
proved to be complicated since architecture would need to initiate
domain free that will need to call back into resctrl that will take
rdtgroup_mutex which means that references need to be taken/released
without locking.
---
fs/resctrl/monitor.c | 52 ++++++++++++++++++++++++++++++++++---------
fs/resctrl/rdtgroup.c | 52 ++++++++++++++++++++++++++++++++++++++-----
2 files changed, 89 insertions(+), 15 deletions(-)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 4565b9864a9e..37df65229109 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -623,14 +623,22 @@ void mon_event_count(void *info)
rr->err = 0;
}
-static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu,
- struct rdt_resource *r)
+/*
+ * Find the software controller's ctrl domain that contains @cpu on resource @r.
+ *
+ * Only called from the mbm_over worker via update_mba_bw() where the returned
+ * domain is kept alive by cancel_delayed_work_sync() in
+ * resctrl_offline_ctrl_domain(). This drains this worker and then waits on
+ * rdtgroup_mutex held here before the architecture can free the ctrl domain.
+ *
+ * Context: Call from RCU read-side critical section.
+ */
+static struct rdt_ctrl_domain *get_sc_ctrl_domain_from_cpu(int cpu,
+ struct rdt_resource *r)
{
struct rdt_ctrl_domain *d;
- lockdep_assert_cpus_held();
-
- list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
+ list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return d;
@@ -691,7 +699,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_l3_mon_domain *dom_m
if (WARN_ON_ONCE(!pmbm_data))
return;
- dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
+ guard(rcu)();
+ dom_mba = get_sc_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
pr_warn_once("Failure to get domain for MBA update\n");
return;
@@ -794,9 +803,19 @@ void cqm_handle_limbo(struct work_struct *work)
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
struct rdt_l3_mon_domain *d;
- cpus_read_lock();
+ /*
+ * Safe to run without CPU hotplug lock. Work is guaranteed to be
+ * canceled before the domain structure is removed.
+ */
mutex_lock(&rdtgroup_mutex);
+ /*
+ * Ensure the worker is dedicated to a CPU as intended and not
+ * relocated by workqueue subsystem as part of CPU going offline.
+ */
+ if (!is_percpu_thread())
+ goto out_unlock;
+
d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
__check_limbo(d, false);
@@ -808,8 +827,8 @@ void cqm_handle_limbo(struct work_struct *work)
delay);
}
+out_unlock:
mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
}
/**
@@ -841,7 +860,10 @@ void mbm_handle_overflow(struct work_struct *work)
struct list_head *head;
struct rdt_resource *r;
- cpus_read_lock();
+ /*
+ * Safe to run without CPU hotplug lock. Work is guaranteed to be
+ * canceled before the domain structure is removed.
+ */
mutex_lock(&rdtgroup_mutex);
/*
@@ -851,6 +873,17 @@ void mbm_handle_overflow(struct work_struct *work)
if (!resctrl_mounted || !resctrl_arch_mon_capable())
goto out_unlock;
+ /*
+ * Ensure the worker is dedicated to a CPU and not relocated by
+ * workqueue subsystem as part of CPU going offline since reading
+ * events depend on smp_processor_id(). After passing this check
+ * smp_processor_id() is valid for entire duration of this worker
+ * since it runs with rdtgroup_mutex held and the offline handler needs
+ * rdtgroup_mutex to offline the CPU being run on here.
+ */
+ if (!is_percpu_thread())
+ goto out_unlock;
+
r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
@@ -875,7 +908,6 @@ void mbm_handle_overflow(struct work_struct *work)
out_unlock:
mutex_unlock(&rdtgroup_mutex);
- cpus_read_unlock();
}
/**
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 6601b138ac7a..9281c5a71063 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4493,6 +4493,29 @@ static void domain_destroy_l3_mon_state(struct rdt_l3_mon_domain *d)
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
+ /*
+ * mbm_handle_overflow() may dereference this ctrl domain via
+ * update_mba_bw()->get_sc_ctrl_domain_from_cpu(). The architecture has
+ * unlinked the domain from the RCU list and waited a grace period, so
+ * no new worker iteration can find it; drain any worker that already
+ * holds a pointer to it before the architecture frees the domain.
+ *
+ * Software controller is enabled/disabled on mount/unmount with
+ * cpus_read_lock() held. Running here with cpus_write_lock() so
+ * there are no concurrent changes to software controller status.
+ */
+ if (r->rid == RDT_RESOURCE_MBA && is_mba_sc(r)) {
+ struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+ struct rdt_l3_mon_domain *mon_d;
+
+ list_for_each_entry(mon_d, &l3->mon_domains, hdr.list) {
+ if (mon_d->hdr.id == d->hdr.id) {
+ cancel_delayed_work_sync(&mon_d->mbm_over);
+ break;
+ }
+ }
+ }
+
mutex_lock(&rdtgroup_mutex);
if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
@@ -4505,6 +4528,24 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
{
struct rdt_l3_mon_domain *d;
+ /*
+ * Called by architecture under CPU hotplug lock as it prepares to remove
+ * the domain which is guaranteed to be accessible here.
+ * The domain has been unlinked from the RCU list and a grace period
+ * has elapsed, so no new worker can be scheduled. Drain any worker that
+ * is in flight or pending before letting architecture proceed to free
+ * the domain that has the workers' struct delayed_work embedded.
+ * Do so before taking rdtgroup_mutex since the workers also acquire it.
+ */
+ if (r->rid == RDT_RESOURCE_L3 &&
+ domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) {
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
+ if (resctrl_is_mbm_enabled())
+ cancel_delayed_work_sync(&d->mbm_over);
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
+ cancel_delayed_work_sync(&d->cqm_limbo);
+ }
+
mutex_lock(&rdtgroup_mutex);
/*
@@ -4521,8 +4562,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
goto out_unlock;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- if (resctrl_is_mbm_enabled())
- cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
/*
* When a package is going down, forcefully
@@ -4533,7 +4572,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
* package never comes back.
*/
__check_limbo(d, true);
- cancel_delayed_work(&d->cqm_limbo);
}
domain_destroy_l3_mon_state(d);
@@ -4714,12 +4752,16 @@ void resctrl_offline_cpu(unsigned int cpu)
d = get_mon_domain_from_cpu(cpu, l3);
if (d) {
if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) {
- cancel_delayed_work(&d->mbm_over);
+ mutex_unlock(&rdtgroup_mutex);
+ cancel_delayed_work_sync(&d->mbm_over);
+ mutex_lock(&rdtgroup_mutex);
mbm_setup_overflow_handler(d, 0, cpu);
}
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
cpu == d->cqm_work_cpu && has_busy_rmid(d)) {
- cancel_delayed_work(&d->cqm_limbo);
+ mutex_unlock(&rdtgroup_mutex);
+ cancel_delayed_work_sync(&d->cqm_limbo);
+ mutex_lock(&rdtgroup_mutex);
cqm_setup_limbo_handler(d, 0, cpu);
}
}
--
2.50.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-22 19:15 ` [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed Reinette Chatre
@ 2026-05-26 15:32 ` Luck, Tony
2026-05-26 17:53 ` Reinette Chatre
2026-05-28 16:12 ` Reinette Chatre
1 sibling, 1 reply; 40+ messages in thread
From: Luck, Tony @ 2026-05-26 15:32 UTC (permalink / raw)
To: Reinette Chatre
Cc: james.morse, Dave.Martin, babu.moger, bp, tglx, dave.hansen, x86,
hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
On Fri, May 22, 2026 at 12:15:13PM -0700, Reinette Chatre wrote:
> - Adding a reference count to the domain structure to avoid the worker
> needing to take CPU hotplug lock. This ended up being very complicated
> with the architecture needing new APIs to manage the reference count
> which cannot cleanly integrate into MPAM since it uses a single
> architecture domain structure to contain both the control and monitoring
> domain structures. Managing the references across mount, unmount,
> online, offline, as well as worker self exit resulted in several
> asymmetrical and complicated paths that were error prone. Locking also
> proved to be complicated since architecture would need to initiate
> domain free that will need to call back into resctrl that will take
> rdtgroup_mutex which means that references need to be taken/released
> without locking.
I'd been working on a reference count approach too. The MPAM combined
domain for control and monitoring doesn't seem insurmountable. Mostly
because it seems unlikely that the problem with worker threads would
ever apply to control domains. Maybe I missed something, but just adding
an architecture *release() function that can be used by file system code
to drop reference counts on the domain when worker threads exit seems
enough.
My patch below.
-Tony
From 611fd8ad816abd37ef9a65b39175ce05907a1d41 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Thu, 21 May 2026 15:14:27 -0700
Subject: [PATCH] fs,mpam,x86/resctrl: Track reference count for L3 monitor
domains
There are race conditions[1] when the last CPU of a domain is taken offline
and a worker thread may access the domain structure after it is freed.
Add a rdt_l3_mon_domain::kref to track users of the domain. Don't try
to cancel worker threads when CPUs are taken offline. Just set the
target CPU for the thread to nr_cpu_ids to indicate the worker needs
to take action next time it runs.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com [1]
---
include/linux/resctrl.h | 4 +++
arch/x86/kernel/cpu/resctrl/core.c | 16 +++++++++--
drivers/resctrl/mpam_resctrl.c | 15 +++++++++-
fs/resctrl/monitor.c | 46 +++++++++++++++++++++++-------
fs/resctrl/rdtgroup.c | 16 +++--------
5 files changed, 71 insertions(+), 26 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 006e57fd7ca5..715cf62a5864 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -4,6 +4,7 @@
#include <linux/cacheinfo.h>
#include <linux/kernel.h>
+#include <linux/kref.h>
#include <linux/list.h>
#include <linux/pid.h>
#include <linux/resctrl_types.h>
@@ -181,6 +182,7 @@ struct mbm_cntr_cfg {
/**
* struct rdt_l3_mon_domain - group of CPUs sharing RDT_RESOURCE_L3 monitoring
* @hdr: common header for different domain types
+ * @kref: count of active users
* @ci_id: cache info id for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
* @mbm_states: Per-event pointer to the MBM event's saved state.
@@ -195,6 +197,7 @@ struct mbm_cntr_cfg {
*/
struct rdt_l3_mon_domain {
struct rdt_domain_hdr hdr;
+ struct kref kref;
unsigned int ci_id;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_states[QOS_NUM_L3_MBM_EVENTS];
@@ -515,6 +518,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
+void resctrl_arch_l3_mon_domain_release(struct kref *kref);
/*
* Architecture hook called at beginning of first file system mount attempt.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7667cf7c4e94..438016f3097c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -396,6 +396,17 @@ static void l3_mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
kfree(hw_dom);
}
+void resctrl_arch_l3_mon_domain_release(struct kref *kref)
+{
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
+
+ d = container_of(kref, struct rdt_l3_mon_domain, kref);
+ hw_dom = resctrl_to_arch_mon_dom(d);
+
+ l3_mon_domain_free(hw_dom);
+}
+
static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
{
struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
@@ -539,6 +550,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
d->hdr.rid = RDT_RESOURCE_L3;
+ kref_init(&d->kref);
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -680,18 +692,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
switch (r->rid) {
case RDT_RESOURCE_L3: {
- struct rdt_hw_l3_mon_domain *hw_dom;
struct rdt_l3_mon_domain *d;
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
synchronize_rcu();
- l3_mon_domain_free(hw_dom);
+ kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
break;
}
case RDT_RESOURCE_PERF_PKG: {
diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c
index 226ff6f532fa..15f688f18fb6 100644
--- a/drivers/resctrl/mpam_resctrl.c
+++ b/drivers/resctrl/mpam_resctrl.c
@@ -1391,6 +1391,8 @@ mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res)
if (!dom)
return ERR_PTR(-ENOMEM);
+ kref_init(&dom->resctrl_mon_dom.kref);
+
if (r->alloc_capable) {
dom->ctrl_comp = ctrl_comp;
@@ -1548,6 +1550,17 @@ int mpam_resctrl_online_cpu(unsigned int cpu)
return 0;
}
+void resctrl_arch_l3_mon_domain_release(struct kref *kref)
+{
+ struct mpam_resctrl_dom *dom;
+ struct rdt_l3_mon_domain *d;
+
+ d = container_of(kref, struct rdt_l3_mon_domain, kref);
+ dom = container_of(d, struct mpam_resctrl_dom, resctrl_mon_dom);
+
+ kfree(dom);
+}
+
void mpam_resctrl_offline_cpu(unsigned int cpu)
{
struct mpam_resctrl_res *res;
@@ -1589,7 +1602,7 @@ void mpam_resctrl_offline_cpu(unsigned int cpu)
}
if (ctrl_dom_empty && mon_dom_empty)
- kfree(dom);
+ kref_put(&dom->resctrl_mon_dom.kref, resctrl_arch_l3_mon_domain_release);
}
}
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9fd901c78dc6..b4af302bbad1 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -799,15 +799,26 @@ void cqm_handle_limbo(struct work_struct *work)
d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
+ if (d->cqm_work_cpu == nr_cpu_ids)
+ goto reschedule;
+
__check_limbo(d, false);
if (has_busy_rmid(d)) {
+reschedule:
d->cqm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
RESCTRL_PICK_ANY_CPU);
+ if (d->cqm_work_cpu == nr_cpu_ids)
+ goto out_release;
schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo,
delay);
+ goto out_unlock;
}
+out_release:
+ kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
+
+out_unlock:
mutex_unlock(&rdtgroup_mutex);
cpus_read_unlock();
}
@@ -829,8 +840,11 @@ void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_
cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
dom->cqm_work_cpu = cpu;
- if (cpu < nr_cpu_ids)
- schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
+ if (cpu < nr_cpu_ids) {
+ kref_get(&dom->kref);
+ if (!schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay))
+ kref_put(&dom->kref, resctrl_arch_l3_mon_domain_release);
+ }
}
void mbm_handle_overflow(struct work_struct *work)
@@ -844,15 +858,17 @@ void mbm_handle_overflow(struct work_struct *work)
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
+ r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+ d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
+
/*
- * If the filesystem has been unmounted this work no longer needs to
- * run.
+ * If the filesystem has been unmounted this work no longer needs to run.
*/
if (!resctrl_mounted || !resctrl_arch_mon_capable())
- goto out_unlock;
+ goto out_release;
- r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
+ if (d->mbm_work_cpu == nr_cpu_ids)
+ goto reschedule;
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp);
@@ -869,9 +885,16 @@ void mbm_handle_overflow(struct work_struct *work)
* Re-check for housekeeping CPUs. This allows the overflow handler to
* move off a nohz_full CPU quickly.
*/
+reschedule:
d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
RESCTRL_PICK_ANY_CPU);
- schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
+ if (d->mbm_work_cpu != nr_cpu_ids) {
+ schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
+ goto out_unlock;
+ }
+
+out_release:
+ kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
@@ -901,8 +924,11 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
dom->mbm_work_cpu = cpu;
- if (cpu < nr_cpu_ids)
- schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
+ if (cpu < nr_cpu_ids) {
+ kref_get(&dom->kref);
+ if (!schedule_delayed_work_on(cpu, &dom->mbm_over, delay))
+ kref_put(&dom->kref, resctrl_arch_l3_mon_domain_release);
+ }
}
int setup_rmid_lru_list(void)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 5dfdaa6f9d8f..a4830a2364da 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4332,8 +4332,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
goto out_unlock;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- if (resctrl_is_mbm_enabled())
- cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
/*
* When a package is going down, forcefully
@@ -4344,7 +4342,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
* package never comes back.
*/
__check_limbo(d, true);
- cancel_delayed_work(&d->cqm_limbo);
}
domain_destroy_l3_mon_state(d);
@@ -4524,15 +4521,10 @@ void resctrl_offline_cpu(unsigned int cpu)
d = get_mon_domain_from_cpu(cpu, l3);
if (d) {
- if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) {
- cancel_delayed_work(&d->mbm_over);
- mbm_setup_overflow_handler(d, 0, cpu);
- }
- if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
- cpu == d->cqm_work_cpu && has_busy_rmid(d)) {
- cancel_delayed_work(&d->cqm_limbo);
- cqm_setup_limbo_handler(d, 0, cpu);
- }
+ if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu)
+ d->mbm_work_cpu = nr_cpu_ids;
+ if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && cpu == d->cqm_work_cpu)
+ d->cqm_work_cpu = nr_cpu_ids;
}
out_unlock:
--
2.54.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-26 15:32 ` Luck, Tony
@ 2026-05-26 17:53 ` Reinette Chatre
2026-05-26 18:27 ` Luck, Tony
0 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-26 17:53 UTC (permalink / raw)
To: Luck, Tony
Cc: james.morse, Dave.Martin, babu.moger, bp, tglx, dave.hansen, x86,
hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
Hi Tony,
On 5/26/26 8:32 AM, Luck, Tony wrote:
Instead of deleting the patch without any comments, could you please provide some
insight to what problems it has and why your solution is better?
> On Fri, May 22, 2026 at 12:15:13PM -0700, Reinette Chatre wrote:
>> - Adding a reference count to the domain structure to avoid the worker
>> needing to take CPU hotplug lock. This ended up being very complicated
>> with the architecture needing new APIs to manage the reference count
>> which cannot cleanly integrate into MPAM since it uses a single
>> architecture domain structure to contain both the control and monitoring
>> domain structures. Managing the references across mount, unmount,
>> online, offline, as well as worker self exit resulted in several
>> asymmetrical and complicated paths that were error prone. Locking also
>> proved to be complicated since architecture would need to initiate
>> domain free that will need to call back into resctrl that will take
>> rdtgroup_mutex which means that references need to be taken/released
>> without locking.
>
> I'd been working on a reference count approach too. The MPAM combined
> domain for control and monitoring doesn't seem insurmountable. Mostly
While technically possible I do not think it is a clean solution to have
the lifetime of the control domain be controlled with a reference in the
monitoring domain.
> because it seems unlikely that the problem with worker threads would
> ever apply to control domains. Maybe I missed something, but just adding
> an architecture *release() function that can be used by file system code
> to drop reference counts on the domain when worker threads exit seems
> enough.
Did you consider the locking implications that I mention in the description
you quoted? More below ...
>
> My patch below.
heh
>
> -Tony
>
>
> From 611fd8ad816abd37ef9a65b39175ce05907a1d41 Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Thu, 21 May 2026 15:14:27 -0700
> Subject: [PATCH] fs,mpam,x86/resctrl: Track reference count for L3 monitor
> domains
>
> There are race conditions[1] when the last CPU of a domain is taken offline
> and a worker thread may access the domain structure after it is freed.
>
> Add a rdt_l3_mon_domain::kref to track users of the domain. Don't try
> to cancel worker threads when CPUs are taken offline. Just set the
> target CPU for the thread to nr_cpu_ids to indicate the worker needs
> to take action next time it runs.
One test I have found to be useful when digging into this is to offline all CPUs
of a domain starting with lowest number. Since overflow worker runs on lowest number
and then is moved to next CPU when it goes offline this stresses this new mechanics.
Have you tried something similar or could you try this test with this solution?
...
> @@ -680,18 +692,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
>
> switch (r->rid) {
> case RDT_RESOURCE_L3: {
> - struct rdt_hw_l3_mon_domain *hw_dom;
> struct rdt_l3_mon_domain *d;
>
> if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> return;
>
> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> - hw_dom = resctrl_to_arch_mon_dom(d);
> resctrl_offline_mon_domain(r, hdr);
> list_del_rcu(&hdr->list);
> synchronize_rcu();
> - l3_mon_domain_free(hw_dom);
> + kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
> break;
> }
To me the idea behind a "domain reference count" is to provide guarantee to any holder of
a reference that the domain *and* its data remains accessible while it holds the reference.
There is the domain structure itself and then the architecture specific, for example,
rdt_hw_l3_mon_domain::arch_mbm_states, and the fs state, for example rdt_l3_mon_domain::mbm_states.
Above snippet moves the freeing of the *architecture* state to be called on kref_put()
while *always* freeing the fs state (resctrl_offline_mon_domain()->domain_destroy_l3_mon_state()).
A worker may thus have a reference to the domain but when it runs it runs without
fs state which is just a new use-after-free.
As I mentioned in the description the release managed by architecture implies that
reference needs to be dropped without rdtgroup_mutex held since the architecture
should also call resctrl_offline_mon_domain() as part of release. Locking needs
to be reworked and needs to adhere to kref rules on kref_get()/kref_put() without
locking.
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-26 17:53 ` Reinette Chatre
@ 2026-05-26 18:27 ` Luck, Tony
2026-05-26 21:05 ` Reinette Chatre
0 siblings, 1 reply; 40+ messages in thread
From: Luck, Tony @ 2026-05-26 18:27 UTC (permalink / raw)
To: Reinette Chatre
Cc: james.morse, Dave.Martin, babu.moger, bp, tglx, dave.hansen, x86,
hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
On Tue, May 26, 2026 at 10:53:59AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/26/26 8:32 AM, Luck, Tony wrote:
>
> Instead of deleting the patch without any comments, could you please provide some
> insight to what problems it has and why your solution is better?
>
> > On Fri, May 22, 2026 at 12:15:13PM -0700, Reinette Chatre wrote:
> >> - Adding a reference count to the domain structure to avoid the worker
> >> needing to take CPU hotplug lock. This ended up being very complicated
> >> with the architecture needing new APIs to manage the reference count
> >> which cannot cleanly integrate into MPAM since it uses a single
> >> architecture domain structure to contain both the control and monitoring
> >> domain structures. Managing the references across mount, unmount,
> >> online, offline, as well as worker self exit resulted in several
> >> asymmetrical and complicated paths that were error prone. Locking also
> >> proved to be complicated since architecture would need to initiate
> >> domain free that will need to call back into resctrl that will take
> >> rdtgroup_mutex which means that references need to be taken/released
> >> without locking.
> >
> > I'd been working on a reference count approach too. The MPAM combined
> > domain for control and monitoring doesn't seem insurmountable. Mostly
>
> While technically possible I do not think it is a clean solution to have
> the lifetime of the control domain be controlled with a reference in the
> monitoring domain.
>
> > because it seems unlikely that the problem with worker threads would
> > ever apply to control domains. Maybe I missed something, but just adding
> > an architecture *release() function that can be used by file system code
> > to drop reference counts on the domain when worker threads exit seems
> > enough.
>
> Did you consider the locking implications that I mention in the description
> you quoted? More below ...
I didn't run into any lockdep splats during testing. But maybe didn't
have enough code coverage to poke into corner cases.
>
> >
> > My patch below.
>
> heh
>
> >
> > -Tony
> >
> >
> > From 611fd8ad816abd37ef9a65b39175ce05907a1d41 Mon Sep 17 00:00:00 2001
> > From: Tony Luck <tony.luck@intel.com>
> > Date: Thu, 21 May 2026 15:14:27 -0700
> > Subject: [PATCH] fs,mpam,x86/resctrl: Track reference count for L3 monitor
> > domains
> >
> > There are race conditions[1] when the last CPU of a domain is taken offline
> > and a worker thread may access the domain structure after it is freed.
> >
> > Add a rdt_l3_mon_domain::kref to track users of the domain. Don't try
> > to cancel worker threads when CPUs are taken offline. Just set the
> > target CPU for the thread to nr_cpu_ids to indicate the worker needs
> > to take action next time it runs.
>
> One test I have found to be useful when digging into this is to offline all CPUs
> of a domain starting with lowest number. Since overflow worker runs on lowest number
> and then is moved to next CPU when it goes offline this stresses this new mechanics.
> Have you tried something similar or could you try this test with this solution?
Yes. My test case ran through all CPUs in the domain in order. That
showed an interesting artifact that when CPU 36 goes offline, the worker
next runs on CPU37, which is the place it needs to be, but it isn't
bound to CPU37. But since d->mbm_work_cpu (or d->cqm_work_cpu) is set
to nr_cpus_id the worker calls schedule_delayed_work_on() to get itself
properly bound to CPU37.
>
> ...
>
> > @@ -680,18 +692,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> >
> > switch (r->rid) {
> > case RDT_RESOURCE_L3: {
> > - struct rdt_hw_l3_mon_domain *hw_dom;
> > struct rdt_l3_mon_domain *d;
> >
> > if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > return;
> >
> > d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> > - hw_dom = resctrl_to_arch_mon_dom(d);
> > resctrl_offline_mon_domain(r, hdr);
> > list_del_rcu(&hdr->list);
> > synchronize_rcu();
> > - l3_mon_domain_free(hw_dom);
> > + kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
> > break;
> > }
>
> To me the idea behind a "domain reference count" is to provide guarantee to any holder of
> a reference that the domain *and* its data remains accessible while it holds the reference.
> There is the domain structure itself and then the architecture specific, for example,
> rdt_hw_l3_mon_domain::arch_mbm_states, and the fs state, for example rdt_l3_mon_domain::mbm_states.
Agreed. I'm not guaranteeing that the whole of the domain structure
(with all the substructures that it points to) are valid. This reference
only promises that the mbm_over, cqm_limbo, mbm_work_cpu, and cqm_work_cpu
fields are still valid. That's enough to solve this issue, but if we
were to adopt this patch would need some comments so that future readers
were not led astray.
> Above snippet moves the freeing of the *architecture* state to be called on kref_put()
> while *always* freeing the fs state (resctrl_offline_mon_domain()->domain_destroy_l3_mon_state()).
>
> A worker may thus have a reference to the domain but when it runs it runs without
> fs state which is just a new use-after-free.
For this specific case, the worker sees "nr_cpus_id" and avoids use of
other fields that may no longer be valid.
> As I mentioned in the description the release managed by architecture implies that
> reference needs to be dropped without rdtgroup_mutex held since the architecture
> should also call resctrl_offline_mon_domain() as part of release. Locking needs
> to be reworked and needs to adhere to kref rules on kref_get()/kref_put() without
> locking.
That's all handled in the existing (unchanged) offline path. The worker
threads only need to call kref_put() to release their hold on the domain
structure that has been mostly freed (and likely has no hold from the
file system by this point).
The release functions in both X86 and MPAM don't need locks, they are
just calling kfree()
>
> Reinette
I see that sashiko found no issues in parts 1-5 of your series (Yay!)
Grumbled about some issues in part 6. And then gave up before getting to
your latest solution to the domain offline problem.
Can you repost just part 9 (or maybe parts 7-9) on top of latest -rc to
see if sashiko is happy at last?
-Tony
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-26 18:27 ` Luck, Tony
@ 2026-05-26 21:05 ` Reinette Chatre
2026-05-26 21:26 ` Luck, Tony
0 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-26 21:05 UTC (permalink / raw)
To: Luck, Tony
Cc: james.morse, Dave.Martin, babu.moger, bp, tglx, dave.hansen, x86,
hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
Hi Tony,
On 5/26/26 11:27 AM, Luck, Tony wrote:
> On Tue, May 26, 2026 at 10:53:59AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/26/26 8:32 AM, Luck, Tony wrote:
>>
>> Instead of deleting the patch without any comments, could you please provide some
>> insight to what problems it has and why your solution is better?
>>
>>> On Fri, May 22, 2026 at 12:15:13PM -0700, Reinette Chatre wrote:
>>>> - Adding a reference count to the domain structure to avoid the worker
>>>> needing to take CPU hotplug lock. This ended up being very complicated
>>>> with the architecture needing new APIs to manage the reference count
>>>> which cannot cleanly integrate into MPAM since it uses a single
>>>> architecture domain structure to contain both the control and monitoring
>>>> domain structures. Managing the references across mount, unmount,
>>>> online, offline, as well as worker self exit resulted in several
>>>> asymmetrical and complicated paths that were error prone. Locking also
>>>> proved to be complicated since architecture would need to initiate
>>>> domain free that will need to call back into resctrl that will take
>>>> rdtgroup_mutex which means that references need to be taken/released
>>>> without locking.
>>>
>>> I'd been working on a reference count approach too. The MPAM combined
>>> domain for control and monitoring doesn't seem insurmountable. Mostly
>>
>> While technically possible I do not think it is a clean solution to have
>> the lifetime of the control domain be controlled with a reference in the
>> monitoring domain.
>>
>>> because it seems unlikely that the problem with worker threads would
>>> ever apply to control domains. Maybe I missed something, but just adding
>>> an architecture *release() function that can be used by file system code
>>> to drop reference counts on the domain when worker threads exit seems
>>> enough.
>>
>> Did you consider the locking implications that I mention in the description
>> you quoted? More below ...
>
> I didn't run into any lockdep splats during testing. But maybe didn't
> have enough code coverage to poke into corner cases.
Your patch avoids lockdep splats by using *resctrl fs* domain reference to protect
only the *architecture* domain data that can be removed without rdtgroup_mutex while
removing the resctrl fs domain data unconditionally. I do not find this to be a sane
usage of the reference count. Since the reference is attached to the resctrl fs domain
(rdt_l3_mon_domain::kref) then I find it reasonable to expect it to protect
struct rdt_l3_mon_domain's data. Could you please elaborate why you disagree?
>>
>>>
>>> My patch below.
>>
>> heh
>>
>>>
>>> -Tony
>>>
>>>
>>> From 611fd8ad816abd37ef9a65b39175ce05907a1d41 Mon Sep 17 00:00:00 2001
>>> From: Tony Luck <tony.luck@intel.com>
>>> Date: Thu, 21 May 2026 15:14:27 -0700
>>> Subject: [PATCH] fs,mpam,x86/resctrl: Track reference count for L3 monitor
>>> domains
>>>
>>> There are race conditions[1] when the last CPU of a domain is taken offline
>>> and a worker thread may access the domain structure after it is freed.
>>>
>>> Add a rdt_l3_mon_domain::kref to track users of the domain. Don't try
>>> to cancel worker threads when CPUs are taken offline. Just set the
>>> target CPU for the thread to nr_cpu_ids to indicate the worker needs
>>> to take action next time it runs.
>>
>> One test I have found to be useful when digging into this is to offline all CPUs
>> of a domain starting with lowest number. Since overflow worker runs on lowest number
>> and then is moved to next CPU when it goes offline this stresses this new mechanics.
>> Have you tried something similar or could you try this test with this solution?
>
> Yes. My test case ran through all CPUs in the domain in order. That
> showed an interesting artifact that when CPU 36 goes offline, the worker
> next runs on CPU37, which is the place it needs to be, but it isn't
> bound to CPU37. But since d->mbm_work_cpu (or d->cqm_work_cpu) is set
> to nr_cpus_id the worker calls schedule_delayed_work_on() to get itself
> properly bound to CPU37.
>>
>> ...
>>
>>> @@ -680,18 +692,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
>>>
>>> switch (r->rid) {
>>> case RDT_RESOURCE_L3: {
>>> - struct rdt_hw_l3_mon_domain *hw_dom;
>>> struct rdt_l3_mon_domain *d;
>>>
>>> if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
>>> return;
>>>
>>> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
>>> - hw_dom = resctrl_to_arch_mon_dom(d);
>>> resctrl_offline_mon_domain(r, hdr);
>>> list_del_rcu(&hdr->list);
>>> synchronize_rcu();
>>> - l3_mon_domain_free(hw_dom);
>>> + kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
>>> break;
>>> }
>>
>> To me the idea behind a "domain reference count" is to provide guarantee to any holder of
>> a reference that the domain *and* its data remains accessible while it holds the reference.
>> There is the domain structure itself and then the architecture specific, for example,
>> rdt_hw_l3_mon_domain::arch_mbm_states, and the fs state, for example rdt_l3_mon_domain::mbm_states.
>
> Agreed. I'm not guaranteeing that the whole of the domain structure
> (with all the substructures that it points to) are valid. This reference
> only promises that the mbm_over, cqm_limbo, mbm_work_cpu, and cqm_work_cpu
> fields are still valid. That's enough to solve this issue, but if we
> were to adopt this patch would need some comments so that future readers
> were not led astray.
Adding reference counting to the domain structure that does not actually protect the
domain structure but instead introduces additional corner cases created just to fix one
issue is not something I am comfortable with. I also do not see adding comments to
explain all the sharp corners as "fixing" it.
>
>> Above snippet moves the freeing of the *architecture* state to be called on kref_put()
>> while *always* freeing the fs state (resctrl_offline_mon_domain()->domain_destroy_l3_mon_state()).
>>
>> A worker may thus have a reference to the domain but when it runs it runs without
>> fs state which is just a new use-after-free.
>
> For this specific case, the worker sees "nr_cpus_id" and avoids use of
> other fields that may no longer be valid.
This does not sound like proper reference counting to me.
>
>> As I mentioned in the description the release managed by architecture implies that
>> reference needs to be dropped without rdtgroup_mutex held since the architecture
>> should also call resctrl_offline_mon_domain() as part of release. Locking needs
>> to be reworked and needs to adhere to kref rules on kref_get()/kref_put() without
>> locking.
>
> That's all handled in the existing (unchanged) offline path. The worker
> threads only need to call kref_put() to release their hold on the domain
> structure that has been mostly freed (and likely has no hold from the
> file system by this point).
>
> The release functions in both X86 and MPAM don't need locks, they are
> just calling kfree()
Right. Not needing locks is possible if the reference does not actually protect
domain state.
... and just ignore the design comments.
>>
>> Reinette
>
> I see that sashiko found no issues in parts 1-5 of your series (Yay!)
> Grumbled about some issues in part 6. And then gave up before getting to
> your latest solution to the domain offline problem.
>
> Can you repost just part 9 (or maybe parts 7-9) on top of latest -rc to
> see if sashiko is happy at last?
... and then will you take a look?
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-26 21:05 ` Reinette Chatre
@ 2026-05-26 21:26 ` Luck, Tony
2026-05-27 1:49 ` Reinette Chatre
0 siblings, 1 reply; 40+ messages in thread
From: Luck, Tony @ 2026-05-26 21:26 UTC (permalink / raw)
To: Chatre, Reinette
Cc: james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com,
bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com,
fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com,
Chen, Yu C, linux-kernel@vger.kernel.org, patches@lists.linux.dev
>> Can you repost just part 9 (or maybe parts 7-9) on top of latest -rc to
>> see if sashiko is happy at last?
>
> ... and then will you take a look?
Yes. Your approach is much cleaner ... I'm just worried that there will be yet another layer of the onion to peel when sashiko looks at it.
There should be a limit to how much effort we spend trying to fix a bug that's technically possible, but so far away from real usage models.
-Tony
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-26 21:26 ` Luck, Tony
@ 2026-05-27 1:49 ` Reinette Chatre
0 siblings, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-27 1:49 UTC (permalink / raw)
To: Luck, Tony
Cc: james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com,
bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com,
fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com,
Chen, Yu C, linux-kernel@vger.kernel.org, patches@lists.linux.dev
On 5/26/26 2:26 PM, Luck, Tony wrote:
>>> Can you repost just part 9 (or maybe parts 7-9) on top of latest -rc to
>>> see if sashiko is happy at last?
>>
>> ... and then will you take a look?
>
> Yes.
https://sashiko.dev/#/patchset/cover.1779834897.git.reinette.chatre%40intel.com
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 2/9] fs/resctrl: Free mon_data structures on rdt_get_tree() failure
2026-05-22 19:15 ` [PATCH v3 2/9] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Reinette Chatre
@ 2026-05-27 15:18 ` Ben Horgan
0 siblings, 0 replies; 40+ messages in thread
From: Ben Horgan @ 2026-05-27 15:18 UTC (permalink / raw)
To: Reinette Chatre, tony.luck, james.morse, Dave.Martin, babu.moger,
bp, tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Reinette,
On 5/22/26 20:15, Reinette Chatre wrote:
> From: Tony Luck <tony.luck@intel.com>
>
> If mkdir_mondata_all() or a subsequent call in rdt_get_tree() fails, the
> mon_data structures allocated by mon_get_kn_priv() are leaked.
>
> Add mon_put_kn_priv() to the out_mongrp error path to free the mon_data
> structures.
>
> Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing mon_data_bits")
> Reported-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Looks good to me.
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
> ---
> Changes since V2:
> - Reword changelog.
> ---
> fs/resctrl/rdtgroup.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 91922fe1ea08..f573db9e6e84 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3081,6 +3081,7 @@ static int rdt_get_tree(struct fs_context *fc)
> kernfs_remove(kn_mondata);
> out_mongrp:
> if (resctrl_arch_mon_capable()) {
> + mon_put_kn_priv();
> rdtgroup_unassign_cntrs(&rdtgroup_default);
> kernfs_remove(kn_mongrp);
> }
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount
2026-05-22 19:15 ` [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount Reinette Chatre
@ 2026-05-28 9:45 ` Ben Horgan
2026-05-28 16:09 ` Reinette Chatre
2026-05-28 13:48 ` Chen Yu
1 sibling, 1 reply; 40+ messages in thread
From: Ben Horgan @ 2026-05-28 9:45 UTC (permalink / raw)
To: Reinette Chatre, tony.luck, james.morse, Dave.Martin, babu.moger,
bp, tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Reinette,
On 5/22/26 20:15, Reinette Chatre wrote:
> From: Tony Luck <tony.luck@intel.com>
>
> During unmount or failure teardown all mon_data structures that contain
> monitoring event file private data are freed after which kernfs nodes are
> removed. However, the RDT_DELETED flag is never set for the statically
> allocated default resource group.
>
> A concurrent reader of an event file associated with the default resource
> group may, after dropping kernfs active protection, block on rdtgroup_mutex
> while unmount proceeds to free the file private data and destroy the kernfs
> node without waiting for the reader.
>
> When the mutex is released, the reader wakes up, observes that RDT_DELETED
> is not set for the default group, and dereferences the already-freed
> file private data.
>
> Set RDT_DELETED for the default group unconditionally since the flag does
> not lead to free of this statically allocated group.
>
> Do not allow a new resctrl mount if there are any waiters on default group
> of previous mount. A new mount will re-initialize the default group that
> would appear to waiters from previous mount as though the default group is
> accessible causing them to access the mon_data structures from the previous
> mount that have been removed.
>
> Fixes: 2a6566038544 ("x86/resctrl: Expand the width of domid by replacing mon_data_bits")
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260508182143.14592-1-tony.luck%40intel.com?part=2 [1]
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> Changes since V2:
> - Rewrite changelog to not describe code as much.
> - Rework changelog to switch to "Reported-by/Closes".
> - Merge the duplicate rdtgroup_remove() comment with the function comment.
> - Fix changelog to not mention that RDT_DELETED flag is set conditionally.
> - Change "Fixes:" tag to point to commit that introduced dynamically
> allocated mon_data this bug involves.
> ---
> fs/resctrl/rdtgroup.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index f573db9e6e84..8a1457825919 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -585,14 +585,20 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
> *
> * On resource group creation via a mkdir, an extra kernfs_node reference is
> * taken to ensure that the rdtgroup structure remains accessible for the
> - * rdtgroup_kn_unlock() calls where it is removed.
> + * rdtgroup_kn_unlock() calls where it is removed. The default group is
> + * statically allocated: it does not have an extra reference but will have
> + * RDT_DELETED set on unmount to support safe access to its associated files
> + * via rdtgroup_kn_lock_live/rdtgroup_kn_unlock().
> *
> - * Drop the extra reference here, then free the rdtgroup structure.
> + * For all but the default group: drop the extra reference, then free the
> + * rdtgroup structure.
> *
> * Return: void
> */
> static void rdtgroup_remove(struct rdtgroup *rdtgrp)
> {
> + if (rdtgrp == &rdtgroup_default)
> + return;
> kernfs_put(rdtgrp->kn);
> kfree(rdtgrp);
> }
> @@ -2975,6 +2981,7 @@ static void resctrl_fs_teardown(void)
> mon_put_kn_priv();
> rdt_pseudo_lock_release();
> rdtgroup_default.mode = RDT_MODE_SHAREABLE;
> + rdtgroup_default.flags = RDT_DELETED;
Would it be better to use |= here?
I expect it's not functionally different but I'm unsure of the interaction with
RDT_DELETED_PLR and it seems best in general unless clearing other possible
flags is intended.
Thanks,
Ben
> closid_exit();
> schemata_list_destroy();
> rdtgroup_destroy_root();
> @@ -3000,6 +3007,12 @@ static int rdt_get_tree(struct fs_context *fc)
> goto out;
> }
>
> + /* Avoid races from pending operations from a previous mount */
> + if (atomic_read(&rdtgroup_default.waitcount) != 0) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> ret = setup_rmid_lru_list();
> if (ret)
> goto out;
> @@ -4275,6 +4288,7 @@ static int rdtgroup_setup_root(struct rdt_fs_context *ctx)
>
> ctx->kfc.root = rdt_root;
> rdtgroup_default.kn = kernfs_root_to_node(rdt_root);
> + rdtgroup_default.flags = 0;
>
> return 0;
> }
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 1/9] fs/resctrl: Move functions to avoid forward references in subsequent fixes
2026-05-22 19:15 ` [PATCH v3 1/9] fs/resctrl: Move functions to avoid forward references in subsequent fixes Reinette Chatre
@ 2026-05-28 10:06 ` Ben Horgan
0 siblings, 0 replies; 40+ messages in thread
From: Ben Horgan @ 2026-05-28 10:06 UTC (permalink / raw)
To: Reinette Chatre, tony.luck, james.morse, Dave.Martin, babu.moger,
bp, tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Reinette,
On 5/22/26 20:15, Reinette Chatre wrote:
> From: Tony Luck <tony.luck@intel.com>
>
> rdt_get_tree() manages resctrl fs mount and rdt_kill_sb() manages resctrl
> fs unmount.
>
> There is significant overlap between error cleanup during resctrl mount
> failure and cleanup on resctrl unmount yet the cleanup is not done
> consistently in these two flows.
>
> Pull some cleanup functions before rdt_get_tree() in preparation for
> a new helper that can be shared between mount and unmount.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Thanks,
Ben
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount
2026-05-22 19:15 ` [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount Reinette Chatre
@ 2026-05-28 10:11 ` Ben Horgan
2026-05-29 14:06 ` Chen, Yu C
1 sibling, 0 replies; 40+ messages in thread
From: Ben Horgan @ 2026-05-28 10:11 UTC (permalink / raw)
To: Reinette Chatre, tony.luck, james.morse, Dave.Martin, babu.moger,
bp, tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Reinette,
On 5/22/26 20:15, Reinette Chatre wrote:
> rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If
> superblock setup fails inside kernfs_get_tree(), the VFS calls kill_sb on
> the same thread before the call returns. rdt_kill_sb() unconditionally
> attempts to acquire rdtgroup_mutex and deadlock occurs.
>
> Move the call to kernfs_get_tree() outside of locks. If kernfs_get_tree()
> fails and ctx->kfc.new_sb_created is set, then rdt_kill_sb() has already
> been called and no further cleanup is needed.
>
> Add an extra hold in this error path on rdtgroup_default.kn to defend against
> other races destroying the root which is then dereferenced in kernfs_kill_sb()
>
> Add resctrl_unmount() helper to keep code consistent between the
> rdt_get_tree() failure path and a normal unmount.
>
> Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support")
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
> Co-developed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Looks good to me.
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Thanks,
Ben
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 5/9] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put()
2026-05-22 19:15 ` [PATCH v3 5/9] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put() Reinette Chatre
@ 2026-05-28 10:51 ` Ben Horgan
0 siblings, 0 replies; 40+ messages in thread
From: Ben Horgan @ 2026-05-28 10:51 UTC (permalink / raw)
To: Reinette Chatre, tony.luck, james.morse, Dave.Martin, babu.moger,
bp, tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Reinette,
On 5/22/26 20:15, Reinette Chatre wrote:
> A struct rdtgroup is reference counted via rdtgroup::waitcount. Callers
> that need the structure to remain valid across a sleep (while waiting on
> acquiring rdtgroup_mutex) take a reference with rdtgroup_kn_get() and
> release it with rdtgroup_kn_put(). The release path is intended to serve
> as the fallback freer: if the count drops to zero and the group has
> already been marked RDT_DELETED, rdtgroup_kn_put() frees the structure.
> The bulk teardown paths free_all_child_rdtgrp() and rmdir_all_sub()
> resulting from a resctrl directory remove or resctrl fs unmount act as
> the primary freer: they hold rdtgroup_mutex and free each rdtgroup whose
> waitcount is zero, otherwise they set RDT_DELETED and leave the freeing
> to the last waiter.
>
> These two freers race. rdtgroup_kn_put() commits waitcount == 0 with
> atomic_dec_and_test() outside rdtgroup_mutex, then reads rdtgroup::flags.
> Between those two operations a concurrent caller of free_all_child_rdtgrp()
> or rmdir_all_sub() (which holds the mutex) can observe waitcount == 0 via
> atomic_read(), call rdtgroup_remove(), and kfree() the structure. The
> subsequent read of rdtgroup::flags in rdtgroup_kn_put() is then a
> use-after-free, and the structure may even be freed twice if the freed
> memory happens to satisfy the RDT_DELETED flag check.
>
> Replace the bare atomic_dec_and_test() with atomic_dec_and_mutex_lock()
> so that the decrement-to-zero takes rdtgroup_mutex before the count
> becomes globally visible. The inspection of rdtgroup::flags then runs
> under the same mutex held by the bulk freers, making the two paths
> mutually exclusive. The common case where the count does not reach
> zero remains lock-free. Defer kernfs_unbreak_active_protection() until
> after the mutex is dropped since kernfs active protections functionally
> wrap rdtgroup_mutex. Remove resource group, which in turn drops its kernfs
> reference, after kernfs protection is restored.
>
> Fixes: b8511ccc75c0 ("x86/resctrl: Fix use-after-free when deleting resource groups")
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=1
> Assisted-by: GitHub_Copilot:gemini-3.1-pro
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Seems good to me.
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
> ---
> Changes since V2:
> - New patch
> ---
> fs/resctrl/rdtgroup.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index d323b515060b..6418395877cf 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -2606,15 +2606,24 @@ static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
>
> static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
> {
> - if (atomic_dec_and_test(&rdtgrp->waitcount) &&
> - (rdtgrp->flags & RDT_DELETED)) {
> + bool needs_free;
> +
> + if (!atomic_dec_and_mutex_lock(&rdtgrp->waitcount, &rdtgroup_mutex)) {
> + kernfs_unbreak_active_protection(kn);
> + return;
> + }
> +
> + needs_free = rdtgrp->flags & RDT_DELETED;
> +
> + mutex_unlock(&rdtgroup_mutex);
> +
> + kernfs_unbreak_active_protection(kn);
> +
> + if (needs_free) {
> if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
> rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
> rdtgroup_pseudo_lock_remove(rdtgrp);
> - kernfs_unbreak_active_protection(kn);
> rdtgroup_remove(rdtgrp);
> - } else {
> - kernfs_unbreak_active_protection(kn);
> }
> }
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 6/9] fs/resctrl: Fix pseudo-locking lifetime handling
2026-05-22 19:15 ` [PATCH v3 6/9] fs/resctrl: Fix pseudo-locking lifetime handling Reinette Chatre
@ 2026-05-28 10:56 ` Ben Horgan
2026-05-28 16:10 ` Reinette Chatre
0 siblings, 1 reply; 40+ messages in thread
From: Ben Horgan @ 2026-05-28 10:56 UTC (permalink / raw)
To: Reinette Chatre, tony.luck, james.morse, Dave.Martin, babu.moger,
bp, tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Reinette,
On 5/22/26 20:15, Reinette Chatre wrote:
> rmdir_all_sub() implements resctrl unmount teardown by iterating through
> all resource groups. Memory cleanup is coordinated by deferring resource
> group removal to rdtgroup_kn_put() if there are lingering waiters.
>
> A couple of issues exist in the pseudo-locking lifetime management:
>
> 1) rmdir_all_sub() unconditionally invokes rdtgroup_pseudo_lock_remove() on
> pseudo-locked groups, even if active references remain. When the deferred
> rdtgroup_kn_put() is finally executed by the last waiter, it erroneously
> calls rdtgroup_pseudo_lock_remove() a second time, resulting in a
> NULL-pointer dereference as the pseudo-lock structures have already been
> freed.
>
> 2) pseudo_lock_dev_release() drops the waitcount reference but misses
> checking if the resource group has been flagged for deletion. If
> resctrl is unmounted while a pseudo-lock device file is open,
> rmdir_all_sub() will observe the active waitcount and defer the deletion
> to the last reference holder. When the device file descriptor is
> subsequently closed, pseudo_lock_dev_release() decrements the
> waitcount without inspecting nor acting on the RDT_DELETED flag.
> Neither the pseudo-lock memory cleanup nor the final resource group
> removal are executed.
>
> 3) rdtgroup_pseudo_lock_remove() calls debugfs_remove_recursive() that
> will wait for any existing debugfs users to complete before it can
> do cleanup. Since the pseudo-locking debugfs handlers take
> rdtgroup_mutex it is required that this debugfs_remove_recursive()
> is not called with rdtgroup_mutex held, yet rmdir_all_sub() does so.
>
> 4) A pseudo-locked group's RMID is freed when it is created. On unmount
> rmdir_all_sub() unconditionally frees all RMID of all groups, resulting
> in a double-free of the pseudo-locked group's RMID. The additional
> consequence of this is that the original free results in the
> pseudo-locked group's RMID being added to the rmid_free_lru linked
> list and the second free then attempts to add the same RMID entry to
> the rmid_free_lru again.
>
> Fix pseudo-locking lifetime handling by separating the pseudo-locking
> infrastructure removal from the pseudo-locking region memory teardown, and
> splitting pseudo-locking infrastructure removal into what needs rdtgroup_mutex
> protection and what does not. With this separation an unmount of resctrl
> fs can proceed to remove the pseudo-locking infrastructure of a pseudo-locked
> region since the global infrastructure it depends on will be removed soon
> after (resctrl_unmount()-> resctrl_fs_teardown()->rdt_pseudo_lock_release()).
> Any active users of the pseudo-locked region via an open file can complete
> safely afterwards and only need to do the pseudo-locked region specific
> memory teardown.
>
> The new RDT_DELETED_PLR resource group flag communicates to waiters whether
> the pseudo-locked region's infrastructure has already been removed.
>
> Fixes: 746e08590b86 ("x86/intel_rdt: Create character device exposing pseudo-locked region")
> Fixes: e0bdfe8e36f3 ("x86/intel_rdt: Support creation/removal of pseudo-locked region")
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=3
> Closes: https://sashiko.dev/#/patchset/20260515193944.15114-1-tony.luck%40intel.com?part=1
> Assisted-by: GitHub_Copilot:gemini-3.1-pro
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> Changes since V2:
> - New patch
> ---
> fs/resctrl/internal.h | 12 +++++++++++
> fs/resctrl/pseudo_lock.c | 44 +++++++++++++++++++++++++++++++++-------
> fs/resctrl/rdtgroup.c | 35 ++++++++++++++++++--------------
> 3 files changed, 69 insertions(+), 22 deletions(-)
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 48af75b9dc85..e7e415ee7766 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -234,6 +234,15 @@ struct rdtgroup {
>
> /* rdtgroup.flags */
> #define RDT_DELETED 1
> +/*
> + * RDT_DELETED_PLR is set when the pseudo-locked group's infrastructure
> + * (its associated device, debugfs files, etc.) has been deleted via
> + * rdtgroup_pseudo_lock_remove(). This can be done while there are
> + * references to the pseudo-locked region since the pseudo-locked region
> + * self is freed separately via pseudo_lock_free() after there are no more
> + * references.
> + */
> +#define RDT_DELETED_PLR 2
I haven't had a proper look at this patch but there are a few places where
'flags = RDT_DELETED' is used rather than 'flags |= RDT_DELETED'. Are these all
fine?
Thanks,
Ben
>
> /* rftype.flags */
> #define RFTYPE_FLAGS_CPUS_LIST 1
> @@ -334,6 +343,7 @@ void rdt_last_cmd_puts(const char *s);
> __printf(1, 2)
> void rdt_last_cmd_printf(const char *fmt, ...);
>
> +void rdtgroup_remove(struct rdtgroup *rdtgrp);
> struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
>
> void rdtgroup_kn_unlock(struct kernfs_node *kn);
> @@ -484,6 +494,7 @@ void rdt_pseudo_lock_release(void);
> int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
>
> void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
> +void pseudo_lock_free(struct rdtgroup *rdtgrp);
>
> #else
> static inline int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp)
> @@ -514,6 +525,7 @@ static inline int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp)
> }
>
> static inline void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) { }
> +static inline void pseudo_lock_free(struct rdtgroup *rdtgrp) { }
> #endif /* CONFIG_RESCTRL_FS_PSEUDO_LOCK */
>
> #endif /* _FS_RESCTRL_INTERNAL_H */
> diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c
> index d1cb0986006e..f9d180eb699e 100644
> --- a/fs/resctrl/pseudo_lock.c
> +++ b/fs/resctrl/pseudo_lock.c
> @@ -333,8 +333,10 @@ static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr)
> *
> * Return: void
> */
> -static void pseudo_lock_free(struct rdtgroup *rdtgrp)
> +void pseudo_lock_free(struct rdtgroup *rdtgrp)
> {
> + if (!rdtgrp->plr)
> + return;
> pseudo_lock_region_clear(rdtgrp->plr);
> kfree(rdtgrp->plr);
> rdtgrp->plr = NULL;
> @@ -928,22 +930,37 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp)
> {
> struct pseudo_lock_region *plr = rdtgrp->plr;
>
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + if (rdtgrp->mode != RDT_MODE_PSEUDO_LOCKSETUP &&
> + rdtgrp->mode != RDT_MODE_PSEUDO_LOCKED)
> + return;
> +
> + if (rdtgrp->flags & RDT_DELETED_PLR)
> + return;
> +
> + rdtgrp->flags |= RDT_DELETED_PLR;
> +
> if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
> /*
> * Default group cannot be a pseudo-locked region so we can
> * free closid here.
> */
> closid_free(rdtgrp->closid);
> - goto free;
> + return;
> }
>
> pseudo_lock_cstates_relax(plr);
> - debugfs_remove_recursive(rdtgrp->plr->debugfs_dir);
> + /*
> + * Drop rdtgroup_mutex to enable debugfs_remove_recursive() to
> + * complete as it waits for active users that may be blocked
> + * waiting on rdtgroup_mutex to complete.
> + */
> + mutex_unlock(&rdtgroup_mutex);
> + debugfs_remove_recursive(plr->debugfs_dir);
> + mutex_lock(&rdtgroup_mutex);
> device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor));
> pseudo_lock_minor_release(plr->minor);
> -
> -free:
> - pseudo_lock_free(rdtgrp);
> }
>
> static int pseudo_lock_dev_open(struct inode *inode, struct file *filp)
> @@ -971,6 +988,7 @@ static int pseudo_lock_dev_open(struct inode *inode, struct file *filp)
> static int pseudo_lock_dev_release(struct inode *inode, struct file *filp)
> {
> struct rdtgroup *rdtgrp;
> + bool needs_free = false;
>
> mutex_lock(&rdtgroup_mutex);
> rdtgrp = filp->private_data;
> @@ -980,8 +998,20 @@ static int pseudo_lock_dev_release(struct inode *inode, struct file *filp)
> return -ENODEV;
> }
> filp->private_data = NULL;
> - atomic_dec(&rdtgrp->waitcount);
> +
> + if (atomic_dec_and_test(&rdtgrp->waitcount) &&
> + (rdtgrp->flags & RDT_DELETED)) {
> + needs_free = true;
> + rdtgroup_pseudo_lock_remove(rdtgrp);
> + }
> +
> mutex_unlock(&rdtgroup_mutex);
> +
> + if (needs_free) {
> + pseudo_lock_free(rdtgrp);
> + rdtgroup_remove(rdtgrp);
> + }
> +
> return 0;
> }
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 6418395877cf..a8b4ac7dd823 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -595,7 +595,7 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
> *
> * Return: void
> */
> -static void rdtgroup_remove(struct rdtgroup *rdtgrp)
> +void rdtgroup_remove(struct rdtgroup *rdtgrp)
> {
> if (rdtgrp == &rdtgroup_default)
> return;
> @@ -2606,23 +2606,24 @@ static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
>
> static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
> {
> - bool needs_free;
> + bool needs_free = false;
>
> if (!atomic_dec_and_mutex_lock(&rdtgrp->waitcount, &rdtgroup_mutex)) {
> kernfs_unbreak_active_protection(kn);
> return;
> }
>
> - needs_free = rdtgrp->flags & RDT_DELETED;
> + if (rdtgrp->flags & RDT_DELETED) {
> + needs_free = true;
> + rdtgroup_pseudo_lock_remove(rdtgrp);
> + }
>
> mutex_unlock(&rdtgroup_mutex);
>
> kernfs_unbreak_active_protection(kn);
>
> if (needs_free) {
> - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
> - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
> - rdtgroup_pseudo_lock_remove(rdtgrp);
> + pseudo_lock_free(rdtgrp);
> rdtgroup_remove(rdtgrp);
> }
> }
> @@ -2885,10 +2886,6 @@ static void rmdir_all_sub(void)
> if (rdtgrp == &rdtgroup_default)
> continue;
>
> - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
> - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
> - rdtgroup_pseudo_lock_remove(rdtgrp);
> -
> /*
> * Give any CPUs back to the default group. We cannot copy
> * cpu_online_mask because a CPU might have executed the
> @@ -2899,15 +2896,23 @@ static void rmdir_all_sub(void)
>
> rdtgroup_unassign_cntrs(rdtgrp);
>
> - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
> -
> kernfs_remove(rdtgrp->kn);
> list_del(&rdtgrp->rdtgroup_list);
>
> - if (atomic_read(&rdtgrp->waitcount) != 0)
> - rdtgrp->flags = RDT_DELETED;
> - else
> + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
> + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
> + rdtgroup_pseudo_lock_remove(rdtgrp);
> + } else {
> + /* Pseudo-locked group's RMID is freed during setup. */
> + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
> + }
> +
> + if (atomic_read(&rdtgrp->waitcount) != 0) {
> + rdtgrp->flags |= RDT_DELETED;
> + } else {
> + pseudo_lock_free(rdtgrp);
> rdtgroup_remove(rdtgrp);
> + }
> }
> /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
> update_closid_rmid(cpu_online_mask, &rdtgroup_default);
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount
2026-05-22 19:15 ` [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount Reinette Chatre
2026-05-28 9:45 ` Ben Horgan
@ 2026-05-28 13:48 ` Chen Yu
2026-05-28 16:09 ` Reinette Chatre
1 sibling, 1 reply; 40+ messages in thread
From: Chen Yu @ 2026-05-28 13:48 UTC (permalink / raw)
To: reinette.chatre
Cc: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen, x86, hpa, ben.horgan, fustini, fenghuay, peternewman,
linux-kernel, patches, Chen Yu
On Fri, 22 May 2026 12:15:07 -0700, Reinette Chatre wrote:
> When the mutex is released, the reader wakes up, observes that RDT_DELETED
> is not set for the default group, and dereferences the already-freed
> file private data.
I wonder if a code call sequence could be added in the commit log,
which would be helpful to quickly understand the race condition:
CPU 0 (read mon_data file) CPU 1 (umount)
-------------------------- --------------
rdtgroup_kn_lock_live()
rdtgrp = &rdtgroup_default
waitcount++
break_active_protection()
mutex_lock()
mon_put_kn_priv()
kfree(mon_data)
mutex_unlock()
mutex_lock()
rdtgrp->flags & RDT_DELETED?
// no -- never set for default
return rdtgrp
md = of->kn->priv // UAF: freed
I did not find issues in current implementation,
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount
2026-05-28 9:45 ` Ben Horgan
@ 2026-05-28 16:09 ` Reinette Chatre
0 siblings, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-28 16:09 UTC (permalink / raw)
To: Ben Horgan, tony.luck, james.morse, Dave.Martin, babu.moger, bp,
tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Ben,
On 5/28/26 2:45 AM, Ben Horgan wrote:
> On 5/22/26 20:15, Reinette Chatre wrote:
..
>> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
>> index f573db9e6e84..8a1457825919 100644
>> --- a/fs/resctrl/rdtgroup.c
>> +++ b/fs/resctrl/rdtgroup.c
>> @@ -585,14 +585,20 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
>> *
>> * On resource group creation via a mkdir, an extra kernfs_node reference is
>> * taken to ensure that the rdtgroup structure remains accessible for the
>> - * rdtgroup_kn_unlock() calls where it is removed.
>> + * rdtgroup_kn_unlock() calls where it is removed. The default group is
>> + * statically allocated: it does not have an extra reference but will have
>> + * RDT_DELETED set on unmount to support safe access to its associated files
>> + * via rdtgroup_kn_lock_live/rdtgroup_kn_unlock().
>> *
>> - * Drop the extra reference here, then free the rdtgroup structure.
>> + * For all but the default group: drop the extra reference, then free the
>> + * rdtgroup structure.
>> *
>> * Return: void
>> */
>> static void rdtgroup_remove(struct rdtgroup *rdtgrp)
>> {
>> + if (rdtgrp == &rdtgroup_default)
>> + return;
>> kernfs_put(rdtgrp->kn);
>> kfree(rdtgrp);
>> }
>> @@ -2975,6 +2981,7 @@ static void resctrl_fs_teardown(void)
>> mon_put_kn_priv();
>> rdt_pseudo_lock_release();
>> rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>> + rdtgroup_default.flags = RDT_DELETED;
>
> Would it be better to use |= here?
> I expect it's not functionally different but I'm unsure of the interaction with
> RDT_DELETED_PLR and it seems best in general unless clearing other possible
> flags is intended.
Thank you very much for taking a look and for the other tags.
I see this initialization as assigning a "reset" value that reflects the
cleanup done by resctrl_fs_teardown() as opposed to recording/adding another state
during runtime.
If resctrl does add RDT_DELETED_PLR then there is no interaction to be
concerned about here since pseudo-locking is not supported on the default
group. You can confirm this by considering the first "if (rdtgrp == &rdtgroup_default)"
check within fs/resctrl/pseudo_lock.c:rdtgroup_locksetup_enter().
Since there is only one flag, RDT_DELETED, possible for the default group
you are correct that using |= is not functionally different but my preference
would be to use assignment in this spot just to make it clear that this is
a reset value.
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount
2026-05-28 13:48 ` Chen Yu
@ 2026-05-28 16:09 ` Reinette Chatre
0 siblings, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-28 16:09 UTC (permalink / raw)
To: Chen Yu
Cc: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen, x86, hpa, ben.horgan, fustini, fenghuay, peternewman,
linux-kernel, patches
Hi Chenyu,
On 5/28/26 6:48 AM, Chen Yu wrote:
> On Fri, 22 May 2026 12:15:07 -0700, Reinette Chatre wrote:
>> When the mutex is released, the reader wakes up, observes that RDT_DELETED
>> is not set for the default group, and dereferences the already-freed
>> file private data.
>
> I wonder if a code call sequence could be added in the commit log,
> which would be helpful to quickly understand the race condition:
>
> CPU 0 (read mon_data file) CPU 1 (umount)
> -------------------------- --------------
>
> rdtgroup_kn_lock_live()
> rdtgrp = &rdtgroup_default
> waitcount++
> break_active_protection()
> mutex_lock()
> mon_put_kn_priv()
> kfree(mon_data)
> mutex_unlock()
> mutex_lock()
> rdtgrp->flags & RDT_DELETED?
> // no -- never set for default
> return rdtgrp
>
> md = of->kn->priv // UAF: freed
>
Yes, something like this will make the issue easier to see. Will do.
>
> I did not find issues in current implementation,
>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Thank you very much.
I just realized that I omitted your other tag from patch #2 [1]. I am
very sorry and will fix that it next posting.
Reinette
[1] https://lore.kernel.org/lkml/7f7712d9-2b23-412a-a3a8-e36c22b4ba32@intel.com/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 6/9] fs/resctrl: Fix pseudo-locking lifetime handling
2026-05-28 10:56 ` Ben Horgan
@ 2026-05-28 16:10 ` Reinette Chatre
0 siblings, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-28 16:10 UTC (permalink / raw)
To: Ben Horgan, tony.luck, james.morse, Dave.Martin, babu.moger, bp,
tglx, dave.hansen
Cc: x86, hpa, fustini, fenghuay, peternewman, yu.c.chen, linux-kernel,
patches
Hi Ben,
On 5/28/26 3:56 AM, Ben Horgan wrote:
> On 5/22/26 20:15, Reinette Chatre wrote:
...
>> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
>> index 48af75b9dc85..e7e415ee7766 100644
>> --- a/fs/resctrl/internal.h
>> +++ b/fs/resctrl/internal.h
>> @@ -234,6 +234,15 @@ struct rdtgroup {
>>
>> /* rdtgroup.flags */
>> #define RDT_DELETED 1
>> +/*
>> + * RDT_DELETED_PLR is set when the pseudo-locked group's infrastructure
>> + * (its associated device, debugfs files, etc.) has been deleted via
>> + * rdtgroup_pseudo_lock_remove(). This can be done while there are
>> + * references to the pseudo-locked region since the pseudo-locked region
>> + * self is freed separately via pseudo_lock_free() after there are no more
>> + * references.
>> + */
>> +#define RDT_DELETED_PLR 2
>
> I haven't had a proper look at this patch but there are a few places where
> 'flags = RDT_DELETED' is used rather than 'flags |= RDT_DELETED'. Are these all
> fine?
These "deleted" flags are not independent. The pseudo-locking infrastructure can/should
only be deleted if the resource group has already been deleted or is about to be
deleted in the same flow. This relationship is clear when looking at the RDT_DELETED_PLR
assignment in pseudo_lock_dev_release() and rdtgroup_kn_put() where RDT_DELETED_PLR
setting depends on RDT_DELETED already being set. There is one place in rmdir_all_sub()
where the order is swapped with pseudo-locking infrastructure is torn down first,
thus setting RDT_DELETED_PLR _before_ RDT_DELETED. This is why rmdir_all_sub() uses
flags |= RDT_DELETED instead of assignment.
Even so, the issue [1] reported by Sashiko is real and how to fix that one is
not obvious to me at this time. These issues uncovered so far are races that can be triggered
by user space stress usage of the pseudo-locking files that is only possible on some
very specific ten-year old hardware. Such usage is contrary to the pseudo-locking usage model
so I am currently considering pulling this fix from the series. To add to this Sashiko reported
another [2] issue in this area while reviewing the resubmission aimed to get review feedback
on the last three patches.
Reinette
[1] https://sashiko.dev/#/patchset/cover.1779476724.git.reinette.chatre%40intel.com?part=6
[2] https://sashiko.dev/#/patchset/cover.1779834897.git.reinette.chatre%40intel.com?part=2
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list
2026-05-22 19:15 ` [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list Reinette Chatre
@ 2026-05-28 16:11 ` Reinette Chatre
2026-05-28 19:04 ` Babu Moger
2026-05-31 8:37 ` Chen, Yu C
0 siblings, 2 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-28 16:11 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
On 5/22/26 12:15 PM, Reinette Chatre wrote:
> static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
> @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
> return;
> }
>
> - list_add_tail_rcu(&d->hdr.list, add_pos);
> -
> err = resctrl_online_mon_domain(r, &d->hdr);
> if (err) {
> - list_del_rcu(&d->hdr.list);
> - synchronize_rcu();
> l3_mon_domain_free(hw_dom);
> + return;
> }
> + list_add_tail_rcu(&d->hdr.list, add_pos);
> }
>
> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
I resubmitted the last three patches of series to obtain Sashiko review [1] and
respond to that feedback here:
a) Sashiko: "Does this reordering expose the monitor directories to userspace before the
domain is actually added to the RCU list?"
Yes. As pointed out by Sashiko there is a short time where the monitoring data files
may be exposed to user space before the monitoring domain is added to the RCU list.
Also pointed out by Sashiko, if user attempts to read from such file it will return
-ENOENT.
This behavior looks correct and acceptable to me.
b) Sashiko: "This is a pre-existing issue, but does resctrl_find_domain() safely traverse
the RCU list?
Yes, this is safe because resctrl_find_domain() is run with cpus_read_lock() that ensures
the list can be traversed safely because the list can only be modified with
CPU write lock held.
One improvement to resctrl that would help support this is to replace the "list_for_each()"
domain list traversals with something like:
list_for_each_entry_rcu(pos, head, member, lockdep_is_cpus_held())
Doing something like above would help document why such list traversal outside of
RCU read-side critical section is safe.
Reinette
[1] https://sashiko.dev/#/patchset/cover.1779834897.git.reinette.chatre%40intel.com
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed
2026-05-22 19:15 ` [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed Reinette Chatre
2026-05-26 15:32 ` Luck, Tony
@ 2026-05-28 16:12 ` Reinette Chatre
1 sibling, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-28 16:12 UTC (permalink / raw)
To: tony.luck, james.morse, Dave.Martin, babu.moger, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
On 5/22/26 12:15 PM, Reinette Chatre wrote:
> @@ -841,7 +860,10 @@ void mbm_handle_overflow(struct work_struct *work)
> struct list_head *head;
> struct rdt_resource *r;
>
> - cpus_read_lock();
> + /*
> + * Safe to run without CPU hotplug lock. Work is guaranteed to be
> + * canceled before the domain structure is removed.
> + */
> mutex_lock(&rdtgroup_mutex);
>
> /*
I resubmitted the last three patches of series to obtain Sashiko review [1] and
respond to that feedback here:
Sashiko: "Could running this worker without the hotplug lock trigger lockdep splats
and expose the architecture backend to races against CPU hotplug?"
No.
Sashiko points out that the MPAM resctrl_arch_get_config() and resctrl_arch_update_one()
hooks contain lockdep_assert_cpus_held() and since these hooks are called via mbm_handle_overflow()
it seems that this could trigger lockdep splats.
Recent commit f52abe650241 ("fs/resctrl: Disallow the software controller when MBM counters are assignable")
established that MPAM does not support the software controller and these hooks will
thus not be called from overflow handler.
Reinette
[1] https://sashiko.dev/#/patchset/cover.1779834897.git.reinette.chatre%40intel.com
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list
2026-05-28 16:11 ` Reinette Chatre
@ 2026-05-28 19:04 ` Babu Moger
2026-05-28 20:56 ` Reinette Chatre
2026-05-31 8:37 ` Chen, Yu C
1 sibling, 1 reply; 40+ messages in thread
From: Babu Moger @ 2026-05-28 19:04 UTC (permalink / raw)
To: Reinette Chatre, tony.luck, james.morse, Dave.Martin, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
Hi Reinette,
On 5/28/26 11:11, Reinette Chatre wrote:
>
>
> On 5/22/26 12:15 PM, Reinette Chatre wrote:
>> static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
>> @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>> return;
>> }
>>
>> - list_add_tail_rcu(&d->hdr.list, add_pos);
>> -
>> err = resctrl_online_mon_domain(r, &d->hdr);
>> if (err) {
>> - list_del_rcu(&d->hdr.list);
>> - synchronize_rcu();
>> l3_mon_domain_free(hw_dom);
>> + return;
>> }
>> + list_add_tail_rcu(&d->hdr.list, add_pos);
>> }
>>
>> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>
> I resubmitted the last three patches of series to obtain Sashiko review [1] and
> respond to that feedback here:
>
> a) Sashiko: "Does this reordering expose the monitor directories to userspace before the
> domain is actually added to the RCU list?"
>
> Yes. As pointed out by Sashiko there is a short time where the monitoring data files
> may be exposed to user space before the monitoring domain is added to the RCU list.
> Also pointed out by Sashiko, if user attempts to read from such file it will return
> -ENOENT.
>
> This behavior looks correct and acceptable to me.
Yes. I agree,
>
> b) Sashiko: "This is a pre-existing issue, but does resctrl_find_domain() safely traverse
> the RCU list?
> Yes, this is safe because resctrl_find_domain() is run with cpus_read_lock() that ensures
> the list can be traversed safely because the list can only be modified with
> CPU write lock held.
> One improvement to resctrl that would help support this is to replace the "list_for_each()"
> domain list traversals with something like:
> list_for_each_entry_rcu(pos, head, member, lockdep_is_cpus_held())
> Doing something like above would help document why such list traversal outside of
> RCU read-side critical section is safe.
Are you planning to make this change at this time? It appears to be a
corner case, and we haven’t observed it in real-world scenarios.
Thanks
Babu
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
` (8 preceding siblings ...)
2026-05-22 19:15 ` [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed Reinette Chatre
@ 2026-05-28 20:08 ` Luck, Tony
2026-05-29 18:37 ` Reinette Chatre
9 siblings, 1 reply; 40+ messages in thread
From: Luck, Tony @ 2026-05-28 20:08 UTC (permalink / raw)
To: Reinette Chatre
Cc: james.morse, Dave.Martin, babu.moger, bp, tglx, dave.hansen, x86,
hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
On Fri, May 22, 2026 at 12:15:04PM -0700, Reinette Chatre wrote:
> v2: https://lore.kernel.org/lkml/20260515193944.15114-1-tony.luck@intel.com/
> v1: https://lore.kernel.org/all/20260508182143.14592-1-tony.luck@intel.com/
>
> While reviewing the AET series [1] Sashiko reported a deadlock during mount,
> and a use-after-free when an L3 domain is removed during CPU offline. Reinette
> found a memory leak in the mount error path while refactoring code for a
> solution to the mount hang.
>
> During review of V1 of this series Sashiko found a new UAF on unmount issue
> that was fixed in V2.
>
> During review of V2 Sashiko uncovered a couple more new issues: TOCTOU
> involving rdtgroup_kn_put() that may lead to UAF or double-free, double
> free of pseudo-locked regions, potential deadlock between resctrl unmount and
> info file readers. Sashiko also found that the CPU offline fix in V2 is flawed
> in its use of is_percpu_thread().
>
> Address all issues identified. This version is significantly different from V2
> because of the additional fixes and reworking of the CPU offline fix. I do not
> consider this version quite "polished" but after all changes made to address
> all the issues identified by Sashiko I would like to check-in with folks (and
> Sashiko) on where the fixes are headed and would appreciate any feedback.
Several of these patches are either authored or co-authored by me, so
I'm uncertain about the ethics of piling on a Reviewed-by tag.
Apart from patch 6 to fix sashiko reported issues in pseudo-locking
everything looks good. I agree with Reinette's assessment[1] that
pursuing the bizarre corner case races for pseudo-locking should
not be a priority. That patch can be dropped.
-Tony
[1] https://lore.kernel.org/all/e40a924f-5398-43bd-821a-2ff9873c5a4c@intel.com/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list
2026-05-28 19:04 ` Babu Moger
@ 2026-05-28 20:56 ` Reinette Chatre
2026-05-28 23:10 ` Moger, Babu
0 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-28 20:56 UTC (permalink / raw)
To: Babu Moger, tony.luck, james.morse, Dave.Martin, bp, tglx,
dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
Hi Babu,
On 5/28/26 12:04 PM, Babu Moger wrote:
> On 5/28/26 11:11, Reinette Chatre wrote:
>>
>>
>> On 5/22/26 12:15 PM, Reinette Chatre wrote:
>>> static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
>>> @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>>> return;
>>> }
>>> - list_add_tail_rcu(&d->hdr.list, add_pos);
>>> -
>>> err = resctrl_online_mon_domain(r, &d->hdr);
>>> if (err) {
>>> - list_del_rcu(&d->hdr.list);
>>> - synchronize_rcu();
>>> l3_mon_domain_free(hw_dom);
>>> + return;
>>> }
>>> + list_add_tail_rcu(&d->hdr.list, add_pos);
>>> }
>>> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>>
>> I resubmitted the last three patches of series to obtain Sashiko review [1] and
>> respond to that feedback here:
>>
>> a) Sashiko: "Does this reordering expose the monitor directories to userspace before the
>> domain is actually added to the RCU list?"
>>
>> Yes. As pointed out by Sashiko there is a short time where the monitoring data files
>> may be exposed to user space before the monitoring domain is added to the RCU list.
>> Also pointed out by Sashiko, if user attempts to read from such file it will return
>> -ENOENT.
>>
>> This behavior looks correct and acceptable to me.
>
> Yes. I agree,
Thank you for taking a look.
>
>>
>> b) Sashiko: "This is a pre-existing issue, but does resctrl_find_domain() safely traverse
>> the RCU list?
>> Yes, this is safe because resctrl_find_domain() is run with cpus_read_lock() that ensures
>> the list can be traversed safely because the list can only be modified with
>> CPU write lock held.
>> One improvement to resctrl that would help support this is to replace the "list_for_each()"
>> domain list traversals with something like:
>> list_for_each_entry_rcu(pos, head, member, lockdep_is_cpus_held())
>> Doing something like above would help document why such list traversal outside of
>> RCU read-side critical section is safe.
>
> Are you planning to make this change at this time? It appears to be a corner case, and we haven’t observed it in real-world scenarios.
>
I am considering it, yes, and was planning to make the change first to determine the impact on this series
before deciding. This is not a functional change but instead making the code conform to best practice that
implicitly documents why it is safe to traverse the list outside of an RCU read-side critical section. I
think this would be a nice addition to resctrl. Including it in this series may actually help reviewers consider
all the flows involved in these races. Do you think this should be deferred? resctrl cleanup work is not
popular [2] ...
Reinette
[2] https://lore.kernel.org/lkml/cover.1777419024.git.reinette.chatre@intel.com/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list
2026-05-28 20:56 ` Reinette Chatre
@ 2026-05-28 23:10 ` Moger, Babu
0 siblings, 0 replies; 40+ messages in thread
From: Moger, Babu @ 2026-05-28 23:10 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, tony.luck, james.morse, Dave.Martin,
bp, tglx, dave.hansen
Cc: x86, hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
Hi Reinette,
On 5/28/2026 3:56 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 5/28/26 12:04 PM, Babu Moger wrote:
>> On 5/28/26 11:11, Reinette Chatre wrote:
>>>
>>>
>>> On 5/22/26 12:15 PM, Reinette Chatre wrote:
>>>> static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
>>>> @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>>>> return;
>>>> }
>>>> - list_add_tail_rcu(&d->hdr.list, add_pos);
>>>> -
>>>> err = resctrl_online_mon_domain(r, &d->hdr);
>>>> if (err) {
>>>> - list_del_rcu(&d->hdr.list);
>>>> - synchronize_rcu();
>>>> l3_mon_domain_free(hw_dom);
>>>> + return;
>>>> }
>>>> + list_add_tail_rcu(&d->hdr.list, add_pos);
>>>> }
>>>> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>>>
>>> I resubmitted the last three patches of series to obtain Sashiko review [1] and
>>> respond to that feedback here:
>>>
>>> a) Sashiko: "Does this reordering expose the monitor directories to userspace before the
>>> domain is actually added to the RCU list?"
>>>
>>> Yes. As pointed out by Sashiko there is a short time where the monitoring data files
>>> may be exposed to user space before the monitoring domain is added to the RCU list.
>>> Also pointed out by Sashiko, if user attempts to read from such file it will return
>>> -ENOENT.
>>>
>>> This behavior looks correct and acceptable to me.
>>
>> Yes. I agree,
>
> Thank you for taking a look.
>
>>
>>>
>>> b) Sashiko: "This is a pre-existing issue, but does resctrl_find_domain() safely traverse
>>> the RCU list?
>>> Yes, this is safe because resctrl_find_domain() is run with cpus_read_lock() that ensures
>>> the list can be traversed safely because the list can only be modified with
>>> CPU write lock held.
>>> One improvement to resctrl that would help support this is to replace the "list_for_each()"
>>> domain list traversals with something like:
>>> list_for_each_entry_rcu(pos, head, member, lockdep_is_cpus_held())
>>> Doing something like above would help document why such list traversal outside of
>>> RCU read-side critical section is safe.
>>
>> Are you planning to make this change at this time? It appears to be a corner case, and we haven’t observed it in real-world scenarios.
>>
>
> I am considering it, yes, and was planning to make the change first to determine the impact on this series
> before deciding. This is not a functional change but instead making the code conform to best practice that
> implicitly documents why it is safe to traverse the list outside of an RCU read-side critical section. I
> think this would be a nice addition to resctrl. Including it in this series may actually help reviewers consider
> all the flows involved in these races. Do you think this should be deferred? resctrl cleanup work is not
> popular [2] ...
If it’s as simple as just using the new call
(list_for_each_entry_rcu()), then that should be fine.
I think this is a good candidate for deferral.
Sashiko appears to be reporting every possible scenario, so we may need
to decide what should be prioritized in such cases. Finally, it is
your(maintainers) call.
Thanks,
Babu
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount
2026-05-22 19:15 ` [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount Reinette Chatre
2026-05-28 10:11 ` Ben Horgan
@ 2026-05-29 14:06 ` Chen, Yu C
2026-05-29 15:53 ` Reinette Chatre
1 sibling, 1 reply; 40+ messages in thread
From: Chen, Yu C @ 2026-05-29 14:06 UTC (permalink / raw)
To: Reinette Chatre
Cc: x86, hpa, ben.horgan, tony.luck, fustini, fenghuay, peternewman,
linux-kernel, patches, james.morse, Dave.Martin, babu.moger, bp,
tglx, dave.hansen
On 5/23/2026 3:15 AM, Reinette Chatre wrote:
> @@ -3085,10 +3105,37 @@ static int rdt_get_tree(struct fs_context *fc)
> RESCTRL_PICK_ANY_CPU);
> }
>
> - goto out;
> + /*
> + * Ensure root kn remains accessible after mutex is unlocked so that
Maybe a little more accurate to say "Ensure rdt_root remains accessible"?
Here we increase reference for rdtgroup_default.kn, and protect
against UAF of
kernfs_kill_sb(sb) ->
info = kernfs_info(sb) ->
kernfs_put(info->root->kn)
where the info->root is UAF rather than the kn.
Other looks good to me.
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
> + * kernfs_kill_sb() can run safely if called by kernfs_get_tree()'s
> + * failure path after creating a superblock but before taking reference
> + * on root kn.
> + */
> + kernfs_get(rdtgroup_default.kn);
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount
2026-05-29 14:06 ` Chen, Yu C
@ 2026-05-29 15:53 ` Reinette Chatre
2026-05-31 8:41 ` Chen, Yu C
0 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-29 15:53 UTC (permalink / raw)
To: Chen, Yu C
Cc: x86, hpa, ben.horgan, tony.luck, fustini, fenghuay, peternewman,
linux-kernel, patches, james.morse, Dave.Martin, babu.moger, bp,
tglx, dave.hansen
Hi Chenyu,
On 5/29/26 7:06 AM, Chen, Yu C wrote:
> On 5/23/2026 3:15 AM, Reinette Chatre wrote:
>> @@ -3085,10 +3105,37 @@ static int rdt_get_tree(struct fs_context *fc)
>> RESCTRL_PICK_ANY_CPU);
>> }
>> - goto out;
>> + /*
>> + * Ensure root kn remains accessible after mutex is unlocked so that
>
> Maybe a little more accurate to say "Ensure rdt_root remains accessible"?
> Here we increase reference for rdtgroup_default.kn, and protect
> against UAF of
> kernfs_kill_sb(sb) ->
> info = kernfs_info(sb) ->
> kernfs_put(info->root->kn)
>
> where the info->root is UAF rather than the kn.
Right. The UAF is indeed on the root self while its lifetime is controlled by references
to its kn (root->kn). Dropping the last reference on root->kn causes root to be freed.
rdt_root is the name of a variable though and its value can actually change in the flow
involved here so I'd prefer not to phrase it exactly like that. How about just
"Ensure root remains accessible ..."?
>
> Other looks good to me.
>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Thank you very much.
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues
2026-05-28 20:08 ` [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Luck, Tony
@ 2026-05-29 18:37 ` Reinette Chatre
2026-05-29 19:06 ` Luck, Tony
0 siblings, 1 reply; 40+ messages in thread
From: Reinette Chatre @ 2026-05-29 18:37 UTC (permalink / raw)
To: Luck, Tony
Cc: james.morse, Dave.Martin, babu.moger, bp, tglx, dave.hansen, x86,
hpa, ben.horgan, fustini, fenghuay, peternewman, yu.c.chen,
linux-kernel, patches
Hi Tony,
On 5/28/26 1:08 PM, Luck, Tony wrote:
> On Fri, May 22, 2026 at 12:15:04PM -0700, Reinette Chatre wrote:
>> v2: https://lore.kernel.org/lkml/20260515193944.15114-1-tony.luck@intel.com/
>> v1: https://lore.kernel.org/all/20260508182143.14592-1-tony.luck@intel.com/
>>
>> While reviewing the AET series [1] Sashiko reported a deadlock during mount,
>> and a use-after-free when an L3 domain is removed during CPU offline. Reinette
>> found a memory leak in the mount error path while refactoring code for a
>> solution to the mount hang.
>>
>> During review of V1 of this series Sashiko found a new UAF on unmount issue
>> that was fixed in V2.
>>
>> During review of V2 Sashiko uncovered a couple more new issues: TOCTOU
>> involving rdtgroup_kn_put() that may lead to UAF or double-free, double
>> free of pseudo-locked regions, potential deadlock between resctrl unmount and
>> info file readers. Sashiko also found that the CPU offline fix in V2 is flawed
>> in its use of is_percpu_thread().
>>
>> Address all issues identified. This version is significantly different from V2
>> because of the additional fixes and reworking of the CPU offline fix. I do not
>> consider this version quite "polished" but after all changes made to address
>> all the issues identified by Sashiko I would like to check-in with folks (and
>> Sashiko) on where the fixes are headed and would appreciate any feedback.
>
> Several of these patches are either authored or co-authored by me, so
> I'm uncertain about the ethics of piling on a Reviewed-by tag.
>
> Apart from patch 6 to fix sashiko reported issues in pseudo-locking
> everything looks good. I agree with Reinette's assessment[1] that
> pursuing the bizarre corner case races for pseudo-locking should
> not be a priority. That patch can be dropped.
Thank you for taking a look.
After thinking about patch 6 more I plan to drop the fixes surrounding the races
but keep the fix for the RMID double-add. Since this will touch pseudo-locking region
lifetime management I expect Sashiko would bring up the pseudo-locking races
during its review but I believe this is a worthy fix.
As a summary of your tags for this series I see:
- patches 1, 2, 3, 4, and 9 have your Signed-off-by.
- patches 5, 7, and 8 do not have any tags from you.
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues
2026-05-29 18:37 ` Reinette Chatre
@ 2026-05-29 19:06 ` Luck, Tony
2026-05-29 20:19 ` Reinette Chatre
0 siblings, 1 reply; 40+ messages in thread
From: Luck, Tony @ 2026-05-29 19:06 UTC (permalink / raw)
To: Chatre, Reinette
Cc: james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com,
bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com,
fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com,
Chen, Yu C, linux-kernel@vger.kernel.org, patches@lists.linux.dev
> > Apart from patch 6 to fix sashiko reported issues in pseudo-locking
> > everything looks good. I agree with Reinette's assessment[1] that
> > pursuing the bizarre corner case races for pseudo-locking should
> > not be a priority. That patch can be dropped.
>
> Thank you for taking a look.
>
> After thinking about patch 6 more I plan to drop the fixes surrounding the races
> but keep the fix for the RMID double-add. Since this will touch pseudo-locking region
> lifetime management I expect Sashiko would bring up the pseudo-locking races
> during its review but I believe this is a worthy fix.
Seems good.
> As a summary of your tags for this series I see:
> - patches 1, 2, 3, 4, and 9 have your Signed-off-by.
> - patches 5, 7, and 8 do not have any tags from you.
You can add my Reviewed-by: Tony Luck <tony.luck@intel.com> to parts 5,7,8
-Tony
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues
2026-05-29 19:06 ` Luck, Tony
@ 2026-05-29 20:19 ` Reinette Chatre
0 siblings, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-05-29 20:19 UTC (permalink / raw)
To: Luck, Tony
Cc: james.morse@arm.com, Dave.Martin@arm.com, babu.moger@amd.com,
bp@alien8.de, tglx@linutronix.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com, ben.horgan@arm.com,
fustini@kernel.org, fenghuay@nvidia.com, peternewman@google.com,
Chen, Yu C, linux-kernel@vger.kernel.org, patches@lists.linux.dev
On 5/29/26 12:06 PM, Luck, Tony wrote:
>>> Apart from patch 6 to fix sashiko reported issues in pseudo-locking
>>> everything looks good. I agree with Reinette's assessment[1] that
>>> pursuing the bizarre corner case races for pseudo-locking should
>>> not be a priority. That patch can be dropped.
>>
>> Thank you for taking a look.
>>
>> After thinking about patch 6 more I plan to drop the fixes surrounding the races
>> but keep the fix for the RMID double-add. Since this will touch pseudo-locking region
>> lifetime management I expect Sashiko would bring up the pseudo-locking races
>> during its review but I believe this is a worthy fix.
>
> Seems good.
>
>> As a summary of your tags for this series I see:
>> - patches 1, 2, 3, 4, and 9 have your Signed-off-by.
>> - patches 5, 7, and 8 do not have any tags from you.
>
> You can add my Reviewed-by: Tony Luck <tony.luck@intel.com> to parts 5,7,8
>
Thank you very much Tony.
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list
2026-05-28 16:11 ` Reinette Chatre
2026-05-28 19:04 ` Babu Moger
@ 2026-05-31 8:37 ` Chen, Yu C
2026-06-01 15:40 ` Reinette Chatre
1 sibling, 1 reply; 40+ messages in thread
From: Chen, Yu C @ 2026-05-31 8:37 UTC (permalink / raw)
To: Reinette Chatre
Cc: x86, hpa, ben.horgan, tony.luck, fustini, fenghuay, peternewman,
linux-kernel, patches, james.morse, Dave.Martin, babu.moger,
dave.hansen, bp, tglx
On 5/29/2026 12:11 AM, Reinette Chatre wrote:
>
>
> On 5/22/26 12:15 PM, Reinette Chatre wrote:
>> static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
>> @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>> return;
>> }
>>
>> - list_add_tail_rcu(&d->hdr.list, add_pos);
>> -
>> err = resctrl_online_mon_domain(r, &d->hdr);
>> if (err) {
>> - list_del_rcu(&d->hdr.list);
>> - synchronize_rcu();
>> l3_mon_domain_free(hw_dom);
>> + return;
>> }
>> + list_add_tail_rcu(&d->hdr.list, add_pos);
>> }
>>
>> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>
> I resubmitted the last three patches of series to obtain Sashiko review [1] and
> respond to that feedback here:
>
> a) Sashiko: "Does this reordering expose the monitor directories to userspace before the
> domain is actually added to the RCU list?"
>
> Yes. As pointed out by Sashiko there is a short time where the monitoring data files
> may be exposed to user space before the monitoring domain is added to the RCU list.
> Also pointed out by Sashiko, if user attempts to read from such file it will return
> -ENOENT.
>
> This behavior looks correct and acceptable to me.
It might not return -ENOENT because the read will be blocked by
cpus_read_lock()
and after the reader has grabed the cpus_read_lock(), the reader will
see the newly
added domain on the list:
CPU 0 (writer) CPU 1 (reader)
========================== ==========================
cpus_write_lock held cpus_read_lock() **BLOCKED**
mkdir mondata files
list_add_tail_rcu()
cpus_write_unlock cpus_read_lock() returns
lock(rdtgroup_mutex)
resctrl_find_domain() **see
a new complete domain**
unlock(rdtgroup_mutex)
cpus_read_unlock()
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount
2026-05-29 15:53 ` Reinette Chatre
@ 2026-05-31 8:41 ` Chen, Yu C
0 siblings, 0 replies; 40+ messages in thread
From: Chen, Yu C @ 2026-05-31 8:41 UTC (permalink / raw)
To: Reinette Chatre
Cc: x86, hpa, ben.horgan, tony.luck, fustini, fenghuay, peternewman,
linux-kernel, patches, james.morse, Dave.Martin, babu.moger, bp,
tglx, dave.hansen
On 5/29/2026 11:53 PM, Reinette Chatre wrote:
> Hi Chenyu,
>
> On 5/29/26 7:06 AM, Chen, Yu C wrote:
>> On 5/23/2026 3:15 AM, Reinette Chatre wrote:
>>> @@ -3085,10 +3105,37 @@ static int rdt_get_tree(struct fs_context *fc)
>>> RESCTRL_PICK_ANY_CPU);
>>> }
>>> - goto out;
>>> + /*
>>> + * Ensure root kn remains accessible after mutex is unlocked so that
>>
>> Maybe a little more accurate to say "Ensure rdt_root remains accessible"?
>> Here we increase reference for rdtgroup_default.kn, and protect
>> against UAF of
>> kernfs_kill_sb(sb) ->
>> info = kernfs_info(sb) ->
>> kernfs_put(info->root->kn)
>>
>> where the info->root is UAF rather than the kn.
>
> Right. The UAF is indeed on the root self while its lifetime is controlled by references
> to its kn (root->kn). Dropping the last reference on root->kn causes root to be freed.
>
> rdt_root is the name of a variable though and its value can actually change in the flow
> involved here so I'd prefer not to phrase it exactly like that. How about just
> "Ensure root remains accessible ..."?
Yes, this looks good to me.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list
2026-05-31 8:37 ` Chen, Yu C
@ 2026-06-01 15:40 ` Reinette Chatre
0 siblings, 0 replies; 40+ messages in thread
From: Reinette Chatre @ 2026-06-01 15:40 UTC (permalink / raw)
To: Chen, Yu C
Cc: x86, hpa, ben.horgan, tony.luck, fustini, fenghuay, peternewman,
linux-kernel, patches, james.morse, Dave.Martin, babu.moger,
dave.hansen, bp, tglx
Hi Chenyu,
On 5/31/26 1:37 AM, Chen, Yu C wrote:
> On 5/29/2026 12:11 AM, Reinette Chatre wrote:
>>
>>
>> On 5/22/26 12:15 PM, Reinette Chatre wrote:
>>> static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
>>> @@ -556,14 +554,12 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
>>> return;
>>> }
>>> - list_add_tail_rcu(&d->hdr.list, add_pos);
>>> -
>>> err = resctrl_online_mon_domain(r, &d->hdr);
>>> if (err) {
>>> - list_del_rcu(&d->hdr.list);
>>> - synchronize_rcu();
>>> l3_mon_domain_free(hw_dom);
>>> + return;
>>> }
>>> + list_add_tail_rcu(&d->hdr.list, add_pos);
>>> }
>>> static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>>
>> I resubmitted the last three patches of series to obtain Sashiko review [1] and
>> respond to that feedback here:
>>
>> a) Sashiko: "Does this reordering expose the monitor directories to userspace before the
>> domain is actually added to the RCU list?"
>>
>> Yes. As pointed out by Sashiko there is a short time where the monitoring data files
>> may be exposed to user space before the monitoring domain is added to the RCU list.
>> Also pointed out by Sashiko, if user attempts to read from such file it will return
>> -ENOENT.
>>
>> This behavior looks correct and acceptable to me.
>
> It might not return -ENOENT because the read will be blocked by cpus_read_lock()
> and after the reader has grabed the cpus_read_lock(), the reader will see the newly
> added domain on the list:
>
> CPU 0 (writer) CPU 1 (reader)
> ========================== ==========================
> cpus_write_lock held cpus_read_lock() **BLOCKED**
> mkdir mondata files
> list_add_tail_rcu()
> cpus_write_unlock cpus_read_lock() returns
> lock(rdtgroup_mutex)
> resctrl_find_domain() **see a new complete domain**
> unlock(rdtgroup_mutex)
> cpus_read_unlock()
Indeed. Thank you very much for the correction.
>
>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Thank you very much.
Reinette
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2026-06-01 15:40 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 19:15 [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 1/9] fs/resctrl: Move functions to avoid forward references in subsequent fixes Reinette Chatre
2026-05-28 10:06 ` Ben Horgan
2026-05-22 19:15 ` [PATCH v3 2/9] fs/resctrl: Free mon_data structures on rdt_get_tree() failure Reinette Chatre
2026-05-27 15:18 ` Ben Horgan
2026-05-22 19:15 ` [PATCH v3 3/9] fs/resctrl: Fix use-after-free during unmount Reinette Chatre
2026-05-28 9:45 ` Ben Horgan
2026-05-28 16:09 ` Reinette Chatre
2026-05-28 13:48 ` Chen Yu
2026-05-28 16:09 ` Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 4/9] fs/resctrl: Fix deadlock for errors during mount Reinette Chatre
2026-05-28 10:11 ` Ben Horgan
2026-05-29 14:06 ` Chen, Yu C
2026-05-29 15:53 ` Reinette Chatre
2026-05-31 8:41 ` Chen, Yu C
2026-05-22 19:15 ` [PATCH v3 5/9] fs/resctrl: Prevent use-after-free in rdtgroup_kn_put() Reinette Chatre
2026-05-28 10:51 ` Ben Horgan
2026-05-22 19:15 ` [PATCH v3 6/9] fs/resctrl: Fix pseudo-locking lifetime handling Reinette Chatre
2026-05-28 10:56 ` Ben Horgan
2026-05-28 16:10 ` Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 7/9] fs/resctrl: Prevent deadlock and use-after-free in info file handlers Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 8/9] x86/resctrl: Ensure domain fully initialized before placed on RCU list Reinette Chatre
2026-05-28 16:11 ` Reinette Chatre
2026-05-28 19:04 ` Babu Moger
2026-05-28 20:56 ` Reinette Chatre
2026-05-28 23:10 ` Moger, Babu
2026-05-31 8:37 ` Chen, Yu C
2026-06-01 15:40 ` Reinette Chatre
2026-05-22 19:15 ` [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed Reinette Chatre
2026-05-26 15:32 ` Luck, Tony
2026-05-26 17:53 ` Reinette Chatre
2026-05-26 18:27 ` Luck, Tony
2026-05-26 21:05 ` Reinette Chatre
2026-05-26 21:26 ` Luck, Tony
2026-05-27 1:49 ` Reinette Chatre
2026-05-28 16:12 ` Reinette Chatre
2026-05-28 20:08 ` [PATCH v3 0/9] x86,fs/resctrl: Fix long-standing issues Luck, Tony
2026-05-29 18:37 ` Reinette Chatre
2026-05-29 19:06 ` Luck, Tony
2026-05-29 20:19 ` Reinette Chatre
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.