[PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
@ 2026-05-01 21:36 Tony Luck
  2026-05-04 15:11 ` Reinette Chatre
  0 siblings, 1 reply; 5+ messages in thread
From: Tony Luck @ 2026-05-01 21:36 UTC (permalink / raw)
  To: Borislav Petkov, x86
  Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu,
	linux-kernel, patches, Tony Luck

Sashiko noticed[1] a user-after-free in the resctrl worker thread code.

resctrl_offline_mon_domain() acquires rdtgroup_mutex and calls
cancel_delayed_work() (non-synchronous) on the per-domain mbm_over and
cqm_limbo delayed_work items, then calls domain_destroy_l3_mon_state()
which frees d->rmid_busy_llc and d->mbm_states[]. After it returns, the
caller (e.g. domain_remove_cpu_mon() in arch/x86 or the mpam equivalent)
deletes the domain from its list and frees the domain itself.

cancel_delayed_work() does not wait for a handler that is already
running. mbm_handle_overflow() and cqm_handle_limbo() each acquire
rdtgroup_mutex before touching the domain, so a handler that started
just before resctrl_offline_mon_domain() runs will block on the mutex.
When resctrl_offline_mon_domain() drops the mutex, the handler wakes
up with a stale 'd' obtained via container_of() and dereferences memory
that has just been freed.

Drain the handlers with cancel_delayed_work_sync() so no handler can be
running or pending against the domain when its state is freed:

  - Add an 'offlining' flag to struct rdt_l3_mon_domain. Under
    rdtgroup_mutex, resctrl_offline_mon_domain() sets it before
    dropping the mutex; the handlers test it after acquiring the
    mutex and exit without rescheduling. This guarantees that
    cancel_delayed_work_sync() does not race with the handler
    re-arming itself.

  - Drop cpus_read_lock() from mbm_handle_overflow() and
    cqm_handle_limbo(). resctrl_offline_mon_domain() can be invoked
    from a CPU hotplug callback that holds the hotplug write lock;
    a handler blocked on cpus_read_lock() in that window would
    deadlock cancel_delayed_work_sync(). The data the handlers
    examine is protected by rdtgroup_mutex, and
    schedule_delayed_work_on() copes with a target CPU that is going
    offline by migrating the work, so the cpus_read_lock() was not
    required for correctness.

  - Restructure resctrl_offline_mon_domain() to: set ->offlining and
    remove the mondata directories under rdtgroup_mutex; drop the
    mutex; cancel_delayed_work_sync() both handlers; reacquire the
    mutex to do the final force __check_limbo() and free the
    per-domain monitor state. The cancel must run with the mutex
    released because the handlers acquire it. Cancel both handlers
    unconditionally on the L3 path (subject to the feature being
    enabled) rather than gating cqm_limbo on has_busy_rmid(): a
    handler may already be executing __check_limbo() with no busy
    RMIDs left, and that invocation must be drained before its 'd'
    is freed.

Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
Assisted-by: Copilot:claude-opus-4.7
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
---
 include/linux/resctrl.h |  1 +
 fs/resctrl/monitor.c    | 18 ++++++++++--------
 fs/resctrl/rdtgroup.c   | 38 ++++++++++++++++++++++++++++++++++----
 3 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 006e57fd7ca5..73f2638b96ad 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -203,6 +203,7 @@ struct rdt_l3_mon_domain {
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
 	struct mbm_cntr_cfg		*cntr_cfg;
+	bool				offlining;
 };
 
 /**
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9fd901c78dc6..e68eec83306e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	struct rdt_l3_mon_domain *d;
 
-	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
 	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
 
+	/*  If this domain is being deleted this work no longer needs to run. */
+	if (d->offlining)
+		goto out_unlock;
+
 	__check_limbo(d, false);
 
 	if (has_busy_rmid(d)) {
@@ -808,8 +811,8 @@ void cqm_handle_limbo(struct work_struct *work)
 					 delay);
 	}
 
+out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
-	cpus_read_unlock();
 }
 
 /**
@@ -841,18 +844,18 @@ void mbm_handle_overflow(struct work_struct *work)
 	struct list_head *head;
 	struct rdt_resource *r;
 
-	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
+	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
+
 	/*
-	 * If the filesystem has been unmounted this work no longer needs to
-	 * run.
+	 * If this domain is being deleted, or the filesystem has been
+	 * unmounted this work no longer needs to run.
 	 */
-	if (!resctrl_mounted || !resctrl_arch_mon_capable())
+	if (d->offlining || !resctrl_mounted || !resctrl_arch_mon_capable())
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
-	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp);
@@ -875,7 +878,6 @@ void mbm_handle_overflow(struct work_struct *work)
 
 out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
-	cpus_read_unlock();
 }
 
 /**
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 8544020ef420..c883149fa373 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4323,7 +4323,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
 
 void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
 {
-	struct rdt_l3_mon_domain *d;
+	struct rdt_l3_mon_domain *d = NULL;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -4341,8 +4341,39 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		goto out_unlock;
 
 	d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
+
+	/*
+	 * Tell mbm_handle_overflow() and cqm_handle_limbo() that this
+	 * domain is going away.
+	 */
+	d->offlining = true;
+
+out_unlock:
+	mutex_unlock(&rdtgroup_mutex);
+
+	if (!d)
+		return;
+
+	/*
+	 * Drain any pending or in-flight overflow / limbo handlers before
+	 * freeing per-domain monitor state (and before the caller frees the
+	 * domain itself). cancel_delayed_work_sync() must be called with
+	 * rdtgroup_mutex released because the handlers acquire it; the
+	 * handlers no longer take cpus_read_lock(), so this is safe to call
+	 * from a CPU hotplug callback that holds the hotplug write lock.
+	 *
+	 * Without the synchronous cancel, a handler that was already running
+	 * and blocked on rdtgroup_mutex when this function was entered could
+	 * wake after the mutex is dropped and dereference d->rmid_busy_llc,
+	 * d->mbm_states[] or the domain itself after they have been freed.
+	 */
 	if (resctrl_is_mbm_enabled())
-		cancel_delayed_work(&d->mbm_over);
+		cancel_delayed_work_sync(&d->mbm_over);
+	if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
+		cancel_delayed_work_sync(&d->cqm_limbo);
+
+	mutex_lock(&rdtgroup_mutex);
+
 	if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
 		/*
 		 * When a package is going down, forcefully
@@ -4353,11 +4384,10 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		 * package never comes back.
 		 */
 		__check_limbo(d, true);
-		cancel_delayed_work(&d->cqm_limbo);
 	}
 
 	domain_destroy_l3_mon_state(d);
-out_unlock:
+
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
@ 2026-05-04 15:11 ` Reinette Chatre
  2026-05-04 22:50   ` Luck, Tony
  0 siblings, 1 reply; 5+ messages in thread
From: Reinette Chatre @ 2026-05-04 15:11 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov, x86
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, Chen Yu, linux-kernel,
	patches

Hi Tony,

On 5/1/26 2:36 PM, Tony Luck wrote:
> Sashiko noticed[1] a user-after-free in the resctrl worker thread code.
> 
> resctrl_offline_mon_domain() acquires rdtgroup_mutex and calls
> cancel_delayed_work() (non-synchronous) on the per-domain mbm_over and
> cqm_limbo delayed_work items, then calls domain_destroy_l3_mon_state()
> which frees d->rmid_busy_llc and d->mbm_states[]. After it returns, the
> caller (e.g. domain_remove_cpu_mon() in arch/x86 or the mpam equivalent)
> deletes the domain from its list and frees the domain itself.
> 
> cancel_delayed_work() does not wait for a handler that is already
> running. mbm_handle_overflow() and cqm_handle_limbo() each acquire
> rdtgroup_mutex before touching the domain, so a handler that started
> just before resctrl_offline_mon_domain() runs will block on the mutex.
> When resctrl_offline_mon_domain() drops the mutex, the handler wakes
> up with a stale 'd' obtained via container_of() and dereferences memory
> that has just been freed.
> 
> Drain the handlers with cancel_delayed_work_sync() so no handler can be
> running or pending against the domain when its state is freed:
> 
>   - Add an 'offlining' flag to struct rdt_l3_mon_domain. Under
>     rdtgroup_mutex, resctrl_offline_mon_domain() sets it before
>     dropping the mutex; the handlers test it after acquiring the
>     mutex and exit without rescheduling. This guarantees that
>     cancel_delayed_work_sync() does not race with the handler
>     re-arming itself.
> 
>   - Drop cpus_read_lock() from mbm_handle_overflow() and
>     cqm_handle_limbo(). resctrl_offline_mon_domain() can be invoked
>     from a CPU hotplug callback that holds the hotplug write lock;
>     a handler blocked on cpus_read_lock() in that window would
>     deadlock cancel_delayed_work_sync(). The data the handlers
>     examine is protected by rdtgroup_mutex, and
>     schedule_delayed_work_on() copes with a target CPU that is going
>     offline by migrating the work, so the cpus_read_lock() was not
>     required for correctness.
> 
>   - Restructure resctrl_offline_mon_domain() to: set ->offlining and
>     remove the mondata directories under rdtgroup_mutex; drop the
>     mutex; cancel_delayed_work_sync() both handlers; reacquire the
>     mutex to do the final force __check_limbo() and free the
>     per-domain monitor state. The cancel must run with the mutex
>     released because the handlers acquire it. Cancel both handlers
>     unconditionally on the L3 path (subject to the feature being
>     enabled) rather than gating cqm_limbo on has_busy_rmid(): a
>     handler may already be executing __check_limbo() with no busy
>     RMIDs left, and that invocation must be drained before its 'd'
>     is freed.
> 
> Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
> Assisted-by: Copilot:claude-opus-4.7
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
> ---
>  include/linux/resctrl.h |  1 +
>  fs/resctrl/monitor.c    | 18 ++++++++++--------
>  fs/resctrl/rdtgroup.c   | 38 ++++++++++++++++++++++++++++++++++----
>  3 files changed, 45 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 006e57fd7ca5..73f2638b96ad 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -203,6 +203,7 @@ struct rdt_l3_mon_domain {
>  	int				mbm_work_cpu;
>  	int				cqm_work_cpu;
>  	struct mbm_cntr_cfg		*cntr_cfg;
> +	bool				offlining;
>  };
>  
>  /**
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 9fd901c78dc6..e68eec83306e 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>  	struct rdt_l3_mon_domain *d;
>  
> -	cpus_read_lock();
>  	mutex_lock(&rdtgroup_mutex);
>  
>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);

Since work always runs on a CPU belonging to the domain, could it be simpler to use
get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
instead of the work contained in the domain struct? 

This seems to more closely match the pattern used in rdtgroup_mondata_show() that
stores the domain ID in its state instead of a pointer to the domain and then uses
resctrl_find_domain() to find domain.

>  
> +	/*  If this domain is being deleted this work no longer needs to run. */
> +	if (d->offlining)
> +		goto out_unlock;
> +


Reinette

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-04 15:11 ` Reinette Chatre
@ 2026-05-04 22:50   ` Luck, Tony
  2026-05-05  4:39     ` Reinette Chatre
  0 siblings, 1 reply; 5+ messages in thread
From: Luck, Tony @ 2026-05-04 22:50 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/1/26 2:36 PM, Tony Luck wrote:
> > Sashiko noticed[1] a user-after-free in the resctrl worker thread code.
> > 
> > resctrl_offline_mon_domain() acquires rdtgroup_mutex and calls
> > cancel_delayed_work() (non-synchronous) on the per-domain mbm_over and
> > cqm_limbo delayed_work items, then calls domain_destroy_l3_mon_state()
> > which frees d->rmid_busy_llc and d->mbm_states[]. After it returns, the
> > caller (e.g. domain_remove_cpu_mon() in arch/x86 or the mpam equivalent)
> > deletes the domain from its list and frees the domain itself.
> > 
> > cancel_delayed_work() does not wait for a handler that is already
> > running. mbm_handle_overflow() and cqm_handle_limbo() each acquire
> > rdtgroup_mutex before touching the domain, so a handler that started
> > just before resctrl_offline_mon_domain() runs will block on the mutex.
> > When resctrl_offline_mon_domain() drops the mutex, the handler wakes
> > up with a stale 'd' obtained via container_of() and dereferences memory
> > that has just been freed.
> > 
> > Drain the handlers with cancel_delayed_work_sync() so no handler can be
> > running or pending against the domain when its state is freed:
> > 
> >   - Add an 'offlining' flag to struct rdt_l3_mon_domain. Under
> >     rdtgroup_mutex, resctrl_offline_mon_domain() sets it before
> >     dropping the mutex; the handlers test it after acquiring the
> >     mutex and exit without rescheduling. This guarantees that
> >     cancel_delayed_work_sync() does not race with the handler
> >     re-arming itself.
> > 
> >   - Drop cpus_read_lock() from mbm_handle_overflow() and
> >     cqm_handle_limbo(). resctrl_offline_mon_domain() can be invoked
> >     from a CPU hotplug callback that holds the hotplug write lock;
> >     a handler blocked on cpus_read_lock() in that window would
> >     deadlock cancel_delayed_work_sync(). The data the handlers
> >     examine is protected by rdtgroup_mutex, and
> >     schedule_delayed_work_on() copes with a target CPU that is going
> >     offline by migrating the work, so the cpus_read_lock() was not
> >     required for correctness.
> > 
> >   - Restructure resctrl_offline_mon_domain() to: set ->offlining and
> >     remove the mondata directories under rdtgroup_mutex; drop the
> >     mutex; cancel_delayed_work_sync() both handlers; reacquire the
> >     mutex to do the final force __check_limbo() and free the
> >     per-domain monitor state. The cancel must run with the mutex
> >     released because the handlers acquire it. Cancel both handlers
> >     unconditionally on the L3 path (subject to the feature being
> >     enabled) rather than gating cqm_limbo on has_busy_rmid(): a
> >     handler may already be executing __check_limbo() with no busy
> >     RMIDs left, and that invocation must be drained before its 'd'
> >     is freed.
> > 
> > Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
> > Assisted-by: Copilot:claude-opus-4.7
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
> > ---
> >  include/linux/resctrl.h |  1 +
> >  fs/resctrl/monitor.c    | 18 ++++++++++--------
> >  fs/resctrl/rdtgroup.c   | 38 ++++++++++++++++++++++++++++++++++----
> >  3 files changed, 45 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 006e57fd7ca5..73f2638b96ad 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -203,6 +203,7 @@ struct rdt_l3_mon_domain {
> >  	int				mbm_work_cpu;
> >  	int				cqm_work_cpu;
> >  	struct mbm_cntr_cfg		*cntr_cfg;
> > +	bool				offlining;
> >  };
> >  
> >  /**
> > diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> > index 9fd901c78dc6..e68eec83306e 100644
> > --- a/fs/resctrl/monitor.c
> > +++ b/fs/resctrl/monitor.c
> > @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
> >  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> >  	struct rdt_l3_mon_domain *d;
> >  
> > -	cpus_read_lock();
> >  	mutex_lock(&rdtgroup_mutex);
> >  
> >  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
> 
> Since work always runs on a CPU belonging to the domain, could it be simpler to use
> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
> instead of the work contained in the domain struct? 

Is this true? When a CPU is taken offline Linux picks another CPU to run
any unexpired queued work. No guarantee that new CPU is in the same
domain. I think a robust solution is going to need a check that the
delayed work handlers are on the right domain. It looks like existing
code doesn't handle this well.

> 
> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
> stores the domain ID in its state instead of a pointer to the domain and then uses
> resctrl_find_domain() to find domain.
> 
> >  
> > +	/*  If this domain is being deleted this work no longer needs to run. */
> > +	if (d->offlining)
> > +		goto out_unlock;
> > +

Claude seemed quite confident about removal of cpus_read_lock() in these
functions. Sashiko is confident that this has opened up several new race
conditions:

https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com

Maybe we need to add a reference count to the rdt_l3_mon_domain
structure and delay freeing it until the last user is gone?

> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-04 22:50   ` Luck, Tony
@ 2026-05-05  4:39     ` Reinette Chatre
  2026-05-05 16:45       ` Luck, Tony
  0 siblings, 1 reply; 5+ messages in thread
From: Reinette Chatre @ 2026-05-05  4:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

Hi Tony,

On 5/4/26 3:50 PM, Luck, Tony wrote:
> On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
>> On 5/1/26 2:36 PM, Tony Luck wrote:

>>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
>>> index 9fd901c78dc6..e68eec83306e 100644
>>> --- a/fs/resctrl/monitor.c
>>> +++ b/fs/resctrl/monitor.c
>>> @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
>>>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>>>  	struct rdt_l3_mon_domain *d;
>>>  
>>> -	cpus_read_lock();
>>>  	mutex_lock(&rdtgroup_mutex);
>>>  
>>>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
>>
>> Since work always runs on a CPU belonging to the domain, could it be simpler to use
>> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
>> instead of the work contained in the domain struct? 
> 
> Is this true? When a CPU is taken offline Linux picks another CPU to run

These are workers supporting monitoring and they read RMIDs from where they are 
running. They do not need to IPI a CPU in another domain to read the monitoring data
of interest. Are you seeing this behave differently?

Theoretically resctrl could have workers run anywhere if the events being handled
don't care which CPU the monitoring data is read from, but the current workers
do not behave this way.

> any unexpired queued work. No guarantee that new CPU is in the same
> domain. I think a robust solution is going to need a check that the

There are a couple of places where the work is scheduled. I understand the
particular scenario you refer to to be the work done by resctrl_offline_cpu().
Specifically,

	resctrl_offline_cpu(unsigned int cpu /* CPU being offlined */)
	{
		struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
		struct rdt_l3_mon_domain *d;
		...
		
		d = get_mon_domain_from_cpu(cpu, l3); /* d is domain to which CPU being offlined belongs */
		if (d) {
			if (/* overflow handler currently running on CPU being offlined */) {
				cancel_delayed_work(&d->mbm_over);
				mbm_setup_overflow_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
			}
			if (/* limbo handler currently running on CPU being offlined */) {
				cancel_delayed_work(&d->cqm_limbo);
				cqm_setup_limbo_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
			}
		}
	}

From above I see that the new CPU being picked for the work *is* guaranteed to be in the
same domain as the CPU being offlined. What am I missing?

> delayed work handlers are on the right domain. It looks like existing
> code doesn't handle this well.

The work is also scheduled from resctrl_online_mon_domain() and then re-scheduled from
the workers themselves, cqm_handle_limbo() and mbm_handle_overflow(). In all cases the
workers stay in their respective domains.

Could you please elaborate how you find that the existing code does not handle this well?
>> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
>> stores the domain ID in its state instead of a pointer to the domain and then uses
>> resctrl_find_domain() to find domain.
>>
>>>  
>>> +	/*  If this domain is being deleted this work no longer needs to run. */
>>> +	if (d->offlining)
>>> +		goto out_unlock;
>>> +
> 
> Claude seemed quite confident about removal of cpus_read_lock() in these
> functions. Sashiko is confident that this has opened up several new race
> conditions:

Domains are stored in a RCU list so if cpus_read_lock() is not possible they can also be
accessed from a RCU read-side critical section for which x86 do not have an example at
this time. This would look something like:
	rcu_read_lock();
	list_for_each_entry_rcu() {
	}
	rcu_read_lock();

The definition of domain_list_lock within arch/x86/kernel/cpu/resctrl/core.c has a nice
writeup created by James about the locking requirements.

I do not think cpus_read_lock() should be removed. Especially not since my suggestion is
to actually traverse the domain list using get_mon_domain_from_cpu() that will complain
loudly if the CPU hotplug lock is not held.

> 
> https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com
> 
> Maybe we need to add a reference count to the rdt_l3_mon_domain
> structure and delay freeing it until the last user is gone?

I still think that using get_mon_domain_from_cpu() in the workers could work here.
Here is the idea more specifically for the MBM overflow handler:

diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 88f1fa0b9d8d..7186d6d02d6e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
-	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
+	d = get_mon_domain_from_cpu(smp_processor_id(), r);
+	if (!d)
+		goto out_unlock;
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp);

update_mba_bw() is run from mbm_handle_overflow() and uses this same pattern
to get the MBA control domain which I find to support that the context is safe.

Reinette

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-05  4:39     ` Reinette Chatre
@ 2026-05-05 16:45       ` Luck, Tony
  0 siblings, 0 replies; 5+ messages in thread
From: Luck, Tony @ 2026-05-05 16:45 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/4/26 3:50 PM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
> >> On 5/1/26 2:36 PM, Tony Luck wrote:
> 
> >>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >>> index 9fd901c78dc6..e68eec83306e 100644
> >>> --- a/fs/resctrl/monitor.c
> >>> +++ b/fs/resctrl/monitor.c
> >>> @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
> >>>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> >>>  	struct rdt_l3_mon_domain *d;
> >>>  
> >>> -	cpus_read_lock();
> >>>  	mutex_lock(&rdtgroup_mutex);
> >>>  
> >>>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
> >>
> >> Since work always runs on a CPU belonging to the domain, could it be simpler to use
> >> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
> >> instead of the work contained in the domain struct? 
> > 
> > Is this true? When a CPU is taken offline Linux picks another CPU to run
> 
> These are workers supporting monitoring and they read RMIDs from where they are 
> running. They do not need to IPI a CPU in another domain to read the monitoring data
> of interest. Are you seeing this behave differently?
> 
> Theoretically resctrl could have workers run anywhere if the events being handled
> don't care which CPU the monitoring data is read from, but the current workers
> do not behave this way.
> 
> > any unexpired queued work. No guarantee that new CPU is in the same
> > domain. I think a robust solution is going to need a check that the
> 
> There are a couple of places where the work is scheduled. I understand the
> particular scenario you refer to to be the work done by resctrl_offline_cpu().
> Specifically,
> 
> 	resctrl_offline_cpu(unsigned int cpu /* CPU being offlined */)
> 	{
> 		struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> 		struct rdt_l3_mon_domain *d;
> 		...
> 		
> 		d = get_mon_domain_from_cpu(cpu, l3); /* d is domain to which CPU being offlined belongs */
> 		if (d) {
> 			if (/* overflow handler currently running on CPU being offlined */) {
> 				cancel_delayed_work(&d->mbm_over);
> 				mbm_setup_overflow_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
> 			}
> 			if (/* limbo handler currently running on CPU being offlined */) {
> 				cancel_delayed_work(&d->cqm_limbo);
> 				cqm_setup_limbo_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
> 			}
> 		}
> 	}
> 
> >From above I see that the new CPU being picked for the work *is* guaranteed to be in the
> same domain as the CPU being offlined. What am I missing?

Here's the scenario:

There is just one CPU left in a domain. The mbm_over worker is woken on
that CPU just as user requests to take that CPU offline (executing on
another CPU).

There is a race.  mbm_handle_overflow() has begun execution, but the
offline process has taken cpus_write_lock() so it blocks and sleeps
waiting for cpus_read_lock().

The offline process calls:

resctrl_arch_offline_cpu()
-> resctrl_offline_cpu
    -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
    -> mbm_setup_overflow_handler(d, 0, cpu);
       Finds there are other CPUs in the domain
-> domain_remove_cpu()
   -> domain_remove_cpu_mon()
      clears bit for this CPU from hdr->cpu_mask and finds mask is empty
      -> resctrl_offline_mon_domain()
         -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
         -> domain_destroy_l3_mon_state()
            -> kfree(d->mbm_states[idx]); /* file system state */
      -> removes domain from L3 resource domain list
      -> l3_mon_domain_free()
         kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
         kfree(hw_dom)

Offline process continues. One of the things it does is checks for
orphans on the run queue of the now deceased CPU and adopts them.

Finally the offline process completes and does cpus_write_unlock()

Now mbm_handle_overflow() can continue. But it is on the wrong CPU
and has a pointer to a work_struct that was freed by the offline
process.

> > delayed work handlers are on the right domain. It looks like existing
> > code doesn't handle this well.
> 
> The work is also scheduled from resctrl_online_mon_domain() and then re-scheduled from
> the workers themselves, cqm_handle_limbo() and mbm_handle_overflow(). In all cases the
> workers stay in their respective domains.
> 
> Could you please elaborate how you find that the existing code does not handle this well?
> >> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
> >> stores the domain ID in its state instead of a pointer to the domain and then uses
> >> resctrl_find_domain() to find domain.
> >>
> >>>  
> >>> +	/*  If this domain is being deleted this work no longer needs to run. */
> >>> +	if (d->offlining)
> >>> +		goto out_unlock;
> >>> +
> > 
> > Claude seemed quite confident about removal of cpus_read_lock() in these
> > functions. Sashiko is confident that this has opened up several new race
> > conditions:
> 
> Domains are stored in a RCU list so if cpus_read_lock() is not possible they can also be
> accessed from a RCU read-side critical section for which x86 do not have an example at
> this time. This would look something like:
> 	rcu_read_lock();
> 	list_for_each_entry_rcu() {
> 	}
> 	rcu_read_lock();
> 
> The definition of domain_list_lock within arch/x86/kernel/cpu/resctrl/core.c has a nice
> writeup created by James about the locking requirements.
> 
> I do not think cpus_read_lock() should be removed. Especially not since my suggestion is
> to actually traverse the domain list using get_mon_domain_from_cpu() that will complain
> loudly if the CPU hotplug lock is not held.

Agreed. This was a bad choice by Claude, and I fell for its reasoning.

> > 
> > https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com
> > 
> > Maybe we need to add a reference count to the rdt_l3_mon_domain
> > structure and delay freeing it until the last user is gone?
> 
> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> Here is the idea more specifically for the MBM overflow handler:
> 
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 88f1fa0b9d8d..7186d6d02d6e 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
>  		goto out_unlock;
>  
>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
> +	if (!d)
> +		goto out_unlock;

This will get a domain, but it will be for the wrong CPU.

Maybe it doesn't hurt for it to perform an extra set of mbm_update()
calls on that CPU?

It will then do:

	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
						   RESCTRL_PICK_ANY_CPU);
	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);

This "wrong" domain already has a worker ... Will this just reset the
timeout to the new "delay" value? Possibly also to a different CPU?

>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
>  		mbm_update(r, d, prgrp);
> 
> update_mba_bw() is run from mbm_handle_overflow() and uses this same pattern
> to get the MBA control domain which I find to support that the context is safe.
> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-05 16:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
2026-05-04 15:11 ` Reinette Chatre
2026-05-04 22:50   ` Luck, Tony
2026-05-05  4:39     ` Reinette Chatre
2026-05-05 16:45       ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox