public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
@ 2026-05-01 21:36 Tony Luck
  2026-05-04 15:11 ` Reinette Chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Tony Luck @ 2026-05-01 21:36 UTC (permalink / raw)
  To: Borislav Petkov, x86
  Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu,
	linux-kernel, patches, Tony Luck

Sashiko noticed[1] a user-after-free in the resctrl worker thread code.

resctrl_offline_mon_domain() acquires rdtgroup_mutex and calls
cancel_delayed_work() (non-synchronous) on the per-domain mbm_over and
cqm_limbo delayed_work items, then calls domain_destroy_l3_mon_state()
which frees d->rmid_busy_llc and d->mbm_states[]. After it returns, the
caller (e.g. domain_remove_cpu_mon() in arch/x86 or the mpam equivalent)
deletes the domain from its list and frees the domain itself.

cancel_delayed_work() does not wait for a handler that is already
running. mbm_handle_overflow() and cqm_handle_limbo() each acquire
rdtgroup_mutex before touching the domain, so a handler that started
just before resctrl_offline_mon_domain() runs will block on the mutex.
When resctrl_offline_mon_domain() drops the mutex, the handler wakes
up with a stale 'd' obtained via container_of() and dereferences memory
that has just been freed.

Drain the handlers with cancel_delayed_work_sync() so no handler can be
running or pending against the domain when its state is freed:

  - Add an 'offlining' flag to struct rdt_l3_mon_domain. Under
    rdtgroup_mutex, resctrl_offline_mon_domain() sets it before
    dropping the mutex; the handlers test it after acquiring the
    mutex and exit without rescheduling. This guarantees that
    cancel_delayed_work_sync() does not race with the handler
    re-arming itself.

  - Drop cpus_read_lock() from mbm_handle_overflow() and
    cqm_handle_limbo(). resctrl_offline_mon_domain() can be invoked
    from a CPU hotplug callback that holds the hotplug write lock;
    a handler blocked on cpus_read_lock() in that window would
    deadlock cancel_delayed_work_sync(). The data the handlers
    examine is protected by rdtgroup_mutex, and
    schedule_delayed_work_on() copes with a target CPU that is going
    offline by migrating the work, so the cpus_read_lock() was not
    required for correctness.

  - Restructure resctrl_offline_mon_domain() to: set ->offlining and
    remove the mondata directories under rdtgroup_mutex; drop the
    mutex; cancel_delayed_work_sync() both handlers; reacquire the
    mutex to do the final force __check_limbo() and free the
    per-domain monitor state. The cancel must run with the mutex
    released because the handlers acquire it. Cancel both handlers
    unconditionally on the L3 path (subject to the feature being
    enabled) rather than gating cqm_limbo on has_busy_rmid(): a
    handler may already be executing __check_limbo() with no busy
    RMIDs left, and that invocation must be drained before its 'd'
    is freed.

Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
Assisted-by: Copilot:claude-opus-4.7
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
---
 include/linux/resctrl.h |  1 +
 fs/resctrl/monitor.c    | 18 ++++++++++--------
 fs/resctrl/rdtgroup.c   | 38 ++++++++++++++++++++++++++++++++++----
 3 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 006e57fd7ca5..73f2638b96ad 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -203,6 +203,7 @@ struct rdt_l3_mon_domain {
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
 	struct mbm_cntr_cfg		*cntr_cfg;
+	bool				offlining;
 };
 
 /**
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9fd901c78dc6..e68eec83306e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
 	struct rdt_l3_mon_domain *d;
 
-	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
 	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
 
+	/*  If this domain is being deleted this work no longer needs to run. */
+	if (d->offlining)
+		goto out_unlock;
+
 	__check_limbo(d, false);
 
 	if (has_busy_rmid(d)) {
@@ -808,8 +811,8 @@ void cqm_handle_limbo(struct work_struct *work)
 					 delay);
 	}
 
+out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
-	cpus_read_unlock();
 }
 
 /**
@@ -841,18 +844,18 @@ void mbm_handle_overflow(struct work_struct *work)
 	struct list_head *head;
 	struct rdt_resource *r;
 
-	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
+	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
+
 	/*
-	 * If the filesystem has been unmounted this work no longer needs to
-	 * run.
+	 * If this domain is being deleted, or the filesystem has been
+	 * unmounted this work no longer needs to run.
 	 */
-	if (!resctrl_mounted || !resctrl_arch_mon_capable())
+	if (d->offlining || !resctrl_mounted || !resctrl_arch_mon_capable())
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
-	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp);
@@ -875,7 +878,6 @@ void mbm_handle_overflow(struct work_struct *work)
 
 out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
-	cpus_read_unlock();
 }
 
 /**
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 8544020ef420..c883149fa373 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4323,7 +4323,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
 
 void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
 {
-	struct rdt_l3_mon_domain *d;
+	struct rdt_l3_mon_domain *d = NULL;
 
 	mutex_lock(&rdtgroup_mutex);
 
@@ -4341,8 +4341,39 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		goto out_unlock;
 
 	d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
+
+	/*
+	 * Tell mbm_handle_overflow() and cqm_handle_limbo() that this
+	 * domain is going away.
+	 */
+	d->offlining = true;
+
+out_unlock:
+	mutex_unlock(&rdtgroup_mutex);
+
+	if (!d)
+		return;
+
+	/*
+	 * Drain any pending or in-flight overflow / limbo handlers before
+	 * freeing per-domain monitor state (and before the caller frees the
+	 * domain itself). cancel_delayed_work_sync() must be called with
+	 * rdtgroup_mutex released because the handlers acquire it; the
+	 * handlers no longer take cpus_read_lock(), so this is safe to call
+	 * from a CPU hotplug callback that holds the hotplug write lock.
+	 *
+	 * Without the synchronous cancel, a handler that was already running
+	 * and blocked on rdtgroup_mutex when this function was entered could
+	 * wake after the mutex is dropped and dereference d->rmid_busy_llc,
+	 * d->mbm_states[] or the domain itself after they have been freed.
+	 */
 	if (resctrl_is_mbm_enabled())
-		cancel_delayed_work(&d->mbm_over);
+		cancel_delayed_work_sync(&d->mbm_over);
+	if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
+		cancel_delayed_work_sync(&d->cqm_limbo);
+
+	mutex_lock(&rdtgroup_mutex);
+
 	if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
 		/*
 		 * When a package is going down, forcefully
@@ -4353,11 +4384,10 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
 		 * package never comes back.
 		 */
 		__check_limbo(d, true);
-		cancel_delayed_work(&d->cqm_limbo);
 	}
 
 	domain_destroy_l3_mon_state(d);
-out_unlock:
+
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
@ 2026-05-04 15:11 ` Reinette Chatre
  2026-05-04 22:50   ` Luck, Tony
  0 siblings, 1 reply; 16+ messages in thread
From: Reinette Chatre @ 2026-05-04 15:11 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov, x86
  Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, Chen Yu, linux-kernel,
	patches

Hi Tony,

On 5/1/26 2:36 PM, Tony Luck wrote:
> Sashiko noticed[1] a user-after-free in the resctrl worker thread code.
> 
> resctrl_offline_mon_domain() acquires rdtgroup_mutex and calls
> cancel_delayed_work() (non-synchronous) on the per-domain mbm_over and
> cqm_limbo delayed_work items, then calls domain_destroy_l3_mon_state()
> which frees d->rmid_busy_llc and d->mbm_states[]. After it returns, the
> caller (e.g. domain_remove_cpu_mon() in arch/x86 or the mpam equivalent)
> deletes the domain from its list and frees the domain itself.
> 
> cancel_delayed_work() does not wait for a handler that is already
> running. mbm_handle_overflow() and cqm_handle_limbo() each acquire
> rdtgroup_mutex before touching the domain, so a handler that started
> just before resctrl_offline_mon_domain() runs will block on the mutex.
> When resctrl_offline_mon_domain() drops the mutex, the handler wakes
> up with a stale 'd' obtained via container_of() and dereferences memory
> that has just been freed.
> 
> Drain the handlers with cancel_delayed_work_sync() so no handler can be
> running or pending against the domain when its state is freed:
> 
>   - Add an 'offlining' flag to struct rdt_l3_mon_domain. Under
>     rdtgroup_mutex, resctrl_offline_mon_domain() sets it before
>     dropping the mutex; the handlers test it after acquiring the
>     mutex and exit without rescheduling. This guarantees that
>     cancel_delayed_work_sync() does not race with the handler
>     re-arming itself.
> 
>   - Drop cpus_read_lock() from mbm_handle_overflow() and
>     cqm_handle_limbo(). resctrl_offline_mon_domain() can be invoked
>     from a CPU hotplug callback that holds the hotplug write lock;
>     a handler blocked on cpus_read_lock() in that window would
>     deadlock cancel_delayed_work_sync(). The data the handlers
>     examine is protected by rdtgroup_mutex, and
>     schedule_delayed_work_on() copes with a target CPU that is going
>     offline by migrating the work, so the cpus_read_lock() was not
>     required for correctness.
> 
>   - Restructure resctrl_offline_mon_domain() to: set ->offlining and
>     remove the mondata directories under rdtgroup_mutex; drop the
>     mutex; cancel_delayed_work_sync() both handlers; reacquire the
>     mutex to do the final force __check_limbo() and free the
>     per-domain monitor state. The cancel must run with the mutex
>     released because the handlers acquire it. Cancel both handlers
>     unconditionally on the L3 path (subject to the feature being
>     enabled) rather than gating cqm_limbo on has_busy_rmid(): a
>     handler may already be executing __check_limbo() with no busy
>     RMIDs left, and that invocation must be drained before its 'd'
>     is freed.
> 
> Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
> Assisted-by: Copilot:claude-opus-4.7
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
> ---
>  include/linux/resctrl.h |  1 +
>  fs/resctrl/monitor.c    | 18 ++++++++++--------
>  fs/resctrl/rdtgroup.c   | 38 ++++++++++++++++++++++++++++++++++----
>  3 files changed, 45 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 006e57fd7ca5..73f2638b96ad 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -203,6 +203,7 @@ struct rdt_l3_mon_domain {
>  	int				mbm_work_cpu;
>  	int				cqm_work_cpu;
>  	struct mbm_cntr_cfg		*cntr_cfg;
> +	bool				offlining;
>  };
>  
>  /**
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 9fd901c78dc6..e68eec83306e 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>  	struct rdt_l3_mon_domain *d;
>  
> -	cpus_read_lock();
>  	mutex_lock(&rdtgroup_mutex);
>  
>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);

Since work always runs on a CPU belonging to the domain, could it be simpler to use
get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
instead of the work contained in the domain struct? 

This seems to more closely match the pattern used in rdtgroup_mondata_show() that
stores the domain ID in its state instead of a pointer to the domain and then uses
resctrl_find_domain() to find domain.

>  
> +	/*  If this domain is being deleted this work no longer needs to run. */
> +	if (d->offlining)
> +		goto out_unlock;
> +


Reinette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-04 15:11 ` Reinette Chatre
@ 2026-05-04 22:50   ` Luck, Tony
  2026-05-05  4:39     ` Reinette Chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2026-05-04 22:50 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/1/26 2:36 PM, Tony Luck wrote:
> > Sashiko noticed[1] a user-after-free in the resctrl worker thread code.
> > 
> > resctrl_offline_mon_domain() acquires rdtgroup_mutex and calls
> > cancel_delayed_work() (non-synchronous) on the per-domain mbm_over and
> > cqm_limbo delayed_work items, then calls domain_destroy_l3_mon_state()
> > which frees d->rmid_busy_llc and d->mbm_states[]. After it returns, the
> > caller (e.g. domain_remove_cpu_mon() in arch/x86 or the mpam equivalent)
> > deletes the domain from its list and frees the domain itself.
> > 
> > cancel_delayed_work() does not wait for a handler that is already
> > running. mbm_handle_overflow() and cqm_handle_limbo() each acquire
> > rdtgroup_mutex before touching the domain, so a handler that started
> > just before resctrl_offline_mon_domain() runs will block on the mutex.
> > When resctrl_offline_mon_domain() drops the mutex, the handler wakes
> > up with a stale 'd' obtained via container_of() and dereferences memory
> > that has just been freed.
> > 
> > Drain the handlers with cancel_delayed_work_sync() so no handler can be
> > running or pending against the domain when its state is freed:
> > 
> >   - Add an 'offlining' flag to struct rdt_l3_mon_domain. Under
> >     rdtgroup_mutex, resctrl_offline_mon_domain() sets it before
> >     dropping the mutex; the handlers test it after acquiring the
> >     mutex and exit without rescheduling. This guarantees that
> >     cancel_delayed_work_sync() does not race with the handler
> >     re-arming itself.
> > 
> >   - Drop cpus_read_lock() from mbm_handle_overflow() and
> >     cqm_handle_limbo(). resctrl_offline_mon_domain() can be invoked
> >     from a CPU hotplug callback that holds the hotplug write lock;
> >     a handler blocked on cpus_read_lock() in that window would
> >     deadlock cancel_delayed_work_sync(). The data the handlers
> >     examine is protected by rdtgroup_mutex, and
> >     schedule_delayed_work_on() copes with a target CPU that is going
> >     offline by migrating the work, so the cpus_read_lock() was not
> >     required for correctness.
> > 
> >   - Restructure resctrl_offline_mon_domain() to: set ->offlining and
> >     remove the mondata directories under rdtgroup_mutex; drop the
> >     mutex; cancel_delayed_work_sync() both handlers; reacquire the
> >     mutex to do the final force __check_limbo() and free the
> >     per-domain monitor state. The cancel must run with the mutex
> >     released because the handlers acquire it. Cancel both handlers
> >     unconditionally on the L3 path (subject to the feature being
> >     enabled) rather than gating cqm_limbo on has_busy_rmid(): a
> >     handler may already be executing __check_limbo() with no busy
> >     RMIDs left, and that invocation must be drained before its 'd'
> >     is freed.
> > 
> > Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
> > Assisted-by: Copilot:claude-opus-4.7
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
> > ---
> >  include/linux/resctrl.h |  1 +
> >  fs/resctrl/monitor.c    | 18 ++++++++++--------
> >  fs/resctrl/rdtgroup.c   | 38 ++++++++++++++++++++++++++++++++++----
> >  3 files changed, 45 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 006e57fd7ca5..73f2638b96ad 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -203,6 +203,7 @@ struct rdt_l3_mon_domain {
> >  	int				mbm_work_cpu;
> >  	int				cqm_work_cpu;
> >  	struct mbm_cntr_cfg		*cntr_cfg;
> > +	bool				offlining;
> >  };
> >  
> >  /**
> > diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> > index 9fd901c78dc6..e68eec83306e 100644
> > --- a/fs/resctrl/monitor.c
> > +++ b/fs/resctrl/monitor.c
> > @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
> >  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> >  	struct rdt_l3_mon_domain *d;
> >  
> > -	cpus_read_lock();
> >  	mutex_lock(&rdtgroup_mutex);
> >  
> >  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
> 
> Since work always runs on a CPU belonging to the domain, could it be simpler to use
> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
> instead of the work contained in the domain struct? 

Is this true? When a CPU is taken offline Linux picks another CPU to run
any unexpired queued work. No guarantee that new CPU is in the same
domain. I think a robust solution is going to need a check that the
delayed work handlers are on the right domain. It looks like existing
code doesn't handle this well.

> 
> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
> stores the domain ID in its state instead of a pointer to the domain and then uses
> resctrl_find_domain() to find domain.
> 
> >  
> > +	/*  If this domain is being deleted this work no longer needs to run. */
> > +	if (d->offlining)
> > +		goto out_unlock;
> > +

Claude seemed quite confident about removal of cpus_read_lock() in these
functions. Sashiko is confident that this has opened up several new race
conditions:

https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com

Maybe we need to add a reference count to the rdt_l3_mon_domain
structure and delay freeing it until the last user is gone?

> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-04 22:50   ` Luck, Tony
@ 2026-05-05  4:39     ` Reinette Chatre
  2026-05-05 16:45       ` Luck, Tony
  0 siblings, 1 reply; 16+ messages in thread
From: Reinette Chatre @ 2026-05-05  4:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

Hi Tony,

On 5/4/26 3:50 PM, Luck, Tony wrote:
> On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
>> On 5/1/26 2:36 PM, Tony Luck wrote:

>>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
>>> index 9fd901c78dc6..e68eec83306e 100644
>>> --- a/fs/resctrl/monitor.c
>>> +++ b/fs/resctrl/monitor.c
>>> @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
>>>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
>>>  	struct rdt_l3_mon_domain *d;
>>>  
>>> -	cpus_read_lock();
>>>  	mutex_lock(&rdtgroup_mutex);
>>>  
>>>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
>>
>> Since work always runs on a CPU belonging to the domain, could it be simpler to use
>> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
>> instead of the work contained in the domain struct? 
> 
> Is this true? When a CPU is taken offline Linux picks another CPU to run

These are workers supporting monitoring and they read RMIDs from where they are 
running. They do not need to IPI a CPU in another domain to read the monitoring data
of interest. Are you seeing this behave differently?

Theoretically resctrl could have workers run anywhere if the events being handled
don't care which CPU the monitoring data is read from, but the current workers
do not behave this way.

> any unexpired queued work. No guarantee that new CPU is in the same
> domain. I think a robust solution is going to need a check that the

There are a couple of places where the work is scheduled. I understand the
particular scenario you refer to to be the work done by resctrl_offline_cpu().
Specifically,

	resctrl_offline_cpu(unsigned int cpu /* CPU being offlined */)
	{
		struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
		struct rdt_l3_mon_domain *d;
		...
		
		d = get_mon_domain_from_cpu(cpu, l3); /* d is domain to which CPU being offlined belongs */
		if (d) {
			if (/* overflow handler currently running on CPU being offlined */) {
				cancel_delayed_work(&d->mbm_over);
				mbm_setup_overflow_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
			}
			if (/* limbo handler currently running on CPU being offlined */) {
				cancel_delayed_work(&d->cqm_limbo);
				cqm_setup_limbo_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
			}
		}
	}

From above I see that the new CPU being picked for the work *is* guaranteed to be in the
same domain as the CPU being offlined. What am I missing?

> delayed work handlers are on the right domain. It looks like existing
> code doesn't handle this well.

The work is also scheduled from resctrl_online_mon_domain() and then re-scheduled from
the workers themselves, cqm_handle_limbo() and mbm_handle_overflow(). In all cases the
workers stay in their respective domains.

Could you please elaborate how you find that the existing code does not handle this well?
>> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
>> stores the domain ID in its state instead of a pointer to the domain and then uses
>> resctrl_find_domain() to find domain.
>>
>>>  
>>> +	/*  If this domain is being deleted this work no longer needs to run. */
>>> +	if (d->offlining)
>>> +		goto out_unlock;
>>> +
> 
> Claude seemed quite confident about removal of cpus_read_lock() in these
> functions. Sashiko is confident that this has opened up several new race
> conditions:

Domains are stored in a RCU list so if cpus_read_lock() is not possible they can also be
accessed from a RCU read-side critical section for which x86 do not have an example at
this time. This would look something like:
	rcu_read_lock();
	list_for_each_entry_rcu() {
	}
	rcu_read_lock();

The definition of domain_list_lock within arch/x86/kernel/cpu/resctrl/core.c has a nice
writeup created by James about the locking requirements.

I do not think cpus_read_lock() should be removed. Especially not since my suggestion is
to actually traverse the domain list using get_mon_domain_from_cpu() that will complain
loudly if the CPU hotplug lock is not held.

> 
> https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com
> 
> Maybe we need to add a reference count to the rdt_l3_mon_domain
> structure and delay freeing it until the last user is gone?

I still think that using get_mon_domain_from_cpu() in the workers could work here.
Here is the idea more specifically for the MBM overflow handler:

diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 88f1fa0b9d8d..7186d6d02d6e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
-	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
+	d = get_mon_domain_from_cpu(smp_processor_id(), r);
+	if (!d)
+		goto out_unlock;
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp);

update_mba_bw() is run from mbm_handle_overflow() and uses this same pattern
to get the MBA control domain which I find to support that the context is safe.

Reinette

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-05  4:39     ` Reinette Chatre
@ 2026-05-05 16:45       ` Luck, Tony
  2026-05-05 21:26         ` Reinette Chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2026-05-05 16:45 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/4/26 3:50 PM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
> >> On 5/1/26 2:36 PM, Tony Luck wrote:
> 
> >>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >>> index 9fd901c78dc6..e68eec83306e 100644
> >>> --- a/fs/resctrl/monitor.c
> >>> +++ b/fs/resctrl/monitor.c
> >>> @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
> >>>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> >>>  	struct rdt_l3_mon_domain *d;
> >>>  
> >>> -	cpus_read_lock();
> >>>  	mutex_lock(&rdtgroup_mutex);
> >>>  
> >>>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
> >>
> >> Since work always runs on a CPU belonging to the domain, could it be simpler to use
> >> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
> >> instead of the work contained in the domain struct? 
> > 
> > Is this true? When a CPU is taken offline Linux picks another CPU to run
> 
> These are workers supporting monitoring and they read RMIDs from where they are 
> running. They do not need to IPI a CPU in another domain to read the monitoring data
> of interest. Are you seeing this behave differently?
> 
> Theoretically resctrl could have workers run anywhere if the events being handled
> don't care which CPU the monitoring data is read from, but the current workers
> do not behave this way.
> 
> > any unexpired queued work. No guarantee that new CPU is in the same
> > domain. I think a robust solution is going to need a check that the
> 
> There are a couple of places where the work is scheduled. I understand the
> particular scenario you refer to to be the work done by resctrl_offline_cpu().
> Specifically,
> 
> 	resctrl_offline_cpu(unsigned int cpu /* CPU being offlined */)
> 	{
> 		struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> 		struct rdt_l3_mon_domain *d;
> 		...
> 		
> 		d = get_mon_domain_from_cpu(cpu, l3); /* d is domain to which CPU being offlined belongs */
> 		if (d) {
> 			if (/* overflow handler currently running on CPU being offlined */) {
> 				cancel_delayed_work(&d->mbm_over);
> 				mbm_setup_overflow_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
> 			}
> 			if (/* limbo handler currently running on CPU being offlined */) {
> 				cancel_delayed_work(&d->cqm_limbo);
> 				cqm_setup_limbo_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
> 			}
> 		}
> 	}
> 
> >From above I see that the new CPU being picked for the work *is* guaranteed to be in the
> same domain as the CPU being offlined. What am I missing?

Here's the scenario:

There is just one CPU left in a domain. The mbm_over worker is woken on
that CPU just as user requests to take that CPU offline (executing on
another CPU).

There is a race.  mbm_handle_overflow() has begun execution, but the
offline process has taken cpus_write_lock() so it blocks and sleeps
waiting for cpus_read_lock().

The offline process calls:

resctrl_arch_offline_cpu()
-> resctrl_offline_cpu
    -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
    -> mbm_setup_overflow_handler(d, 0, cpu);
       Finds there are other CPUs in the domain
-> domain_remove_cpu()
   -> domain_remove_cpu_mon()
      clears bit for this CPU from hdr->cpu_mask and finds mask is empty
      -> resctrl_offline_mon_domain()
         -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
         -> domain_destroy_l3_mon_state()
            -> kfree(d->mbm_states[idx]); /* file system state */
      -> removes domain from L3 resource domain list
      -> l3_mon_domain_free()
         kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
         kfree(hw_dom)

Offline process continues. One of the things it does is checks for
orphans on the run queue of the now deceased CPU and adopts them.

Finally the offline process completes and does cpus_write_unlock()

Now mbm_handle_overflow() can continue. But it is on the wrong CPU
and has a pointer to a work_struct that was freed by the offline
process.

> > delayed work handlers are on the right domain. It looks like existing
> > code doesn't handle this well.
> 
> The work is also scheduled from resctrl_online_mon_domain() and then re-scheduled from
> the workers themselves, cqm_handle_limbo() and mbm_handle_overflow(). In all cases the
> workers stay in their respective domains.
> 
> Could you please elaborate how you find that the existing code does not handle this well?
> >> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
> >> stores the domain ID in its state instead of a pointer to the domain and then uses
> >> resctrl_find_domain() to find domain.
> >>
> >>>  
> >>> +	/*  If this domain is being deleted this work no longer needs to run. */
> >>> +	if (d->offlining)
> >>> +		goto out_unlock;
> >>> +
> > 
> > Claude seemed quite confident about removal of cpus_read_lock() in these
> > functions. Sashiko is confident that this has opened up several new race
> > conditions:
> 
> Domains are stored in a RCU list so if cpus_read_lock() is not possible they can also be
> accessed from a RCU read-side critical section for which x86 do not have an example at
> this time. This would look something like:
> 	rcu_read_lock();
> 	list_for_each_entry_rcu() {
> 	}
> 	rcu_read_lock();
> 
> The definition of domain_list_lock within arch/x86/kernel/cpu/resctrl/core.c has a nice
> writeup created by James about the locking requirements.
> 
> I do not think cpus_read_lock() should be removed. Especially not since my suggestion is
> to actually traverse the domain list using get_mon_domain_from_cpu() that will complain
> loudly if the CPU hotplug lock is not held.

Agreed. This was a bad choice by Claude, and I fell for its reasoning.

> > 
> > https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com
> > 
> > Maybe we need to add a reference count to the rdt_l3_mon_domain
> > structure and delay freeing it until the last user is gone?
> 
> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> Here is the idea more specifically for the MBM overflow handler:
> 
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 88f1fa0b9d8d..7186d6d02d6e 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
>  		goto out_unlock;
>  
>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
> +	if (!d)
> +		goto out_unlock;

This will get a domain, but it will be for the wrong CPU.

Maybe it doesn't hurt for it to perform an extra set of mbm_update()
calls on that CPU?

It will then do:

	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
						   RESCTRL_PICK_ANY_CPU);
	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);

This "wrong" domain already has a worker ... Will this just reset the
timeout to the new "delay" value? Possibly also to a different CPU?

>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
>  		mbm_update(r, d, prgrp);
> 
> update_mba_bw() is run from mbm_handle_overflow() and uses this same pattern
> to get the MBA control domain which I find to support that the context is safe.
> 
> Reinette

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-05 16:45       ` Luck, Tony
@ 2026-05-05 21:26         ` Reinette Chatre
  2026-05-05 23:07           ` Luck, Tony
  0 siblings, 1 reply; 16+ messages in thread
From: Reinette Chatre @ 2026-05-05 21:26 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

Hi Tony,

On 5/5/26 9:45 AM, Luck, Tony wrote:
> On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:

... 
> Here's the scenario:
> 
> There is just one CPU left in a domain. The mbm_over worker is woken on
> that CPU just as user requests to take that CPU offline (executing on
> another CPU).
> 
> There is a race.  mbm_handle_overflow() has begun execution, but the
> offline process has taken cpus_write_lock() so it blocks and sleeps
> waiting for cpus_read_lock().
> 
> The offline process calls:
> 
> resctrl_arch_offline_cpu()
> -> resctrl_offline_cpu
>     -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */

Here I expect d->mbm_work_cpu to be set to this one CPU that is left in the domain.

>     -> mbm_setup_overflow_handler(d, 0, cpu);
>        Finds there are other CPUs in the domain

I do not think that this will find other CPUs in the domain here since
the scenario starts with there being only one CPU left in the domain and it is
currently running the overflow handler . From what I can tell, in this scenario,
mbm_setup_overflow_handler() will return without scheduling the overflow handler
on any CPU.

After mbm_setup_overflow_handler() returns I expect d->mbm_work_cpu to be set to 
nr_cpus_ids.

This detail does not impact the race you are highlighting though.

> -> domain_remove_cpu()
>    -> domain_remove_cpu_mon()
>       clears bit for this CPU from hdr->cpu_mask and finds mask is empty
>       -> resctrl_offline_mon_domain()
>          -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
>          -> domain_destroy_l3_mon_state()
>             -> kfree(d->mbm_states[idx]); /* file system state */
>       -> removes domain from L3 resource domain list
>       -> l3_mon_domain_free()
>          kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
>          kfree(hw_dom)
> 
> Offline process continues. One of the things it does is checks for
> orphans on the run queue of the now deceased CPU and adopts them.
> 
> Finally the offline process completes and does cpus_write_unlock()
> 
> Now mbm_handle_overflow() can continue. But it is on the wrong CPU
> and has a pointer to a work_struct that was freed by the offline
> process.

Thank you for the explanation, I did not consider this scenario. From what I understand
folks encountering this scenario should also encounter a splat from the MBM overflow
handler from all the places where it runs smp_processor_id(). Well, actually, if these
people are running with CONFIG_DEBUG_PREEMPT enabled because of the
debug_smp_processor_id()->check_preemption_disabled()->is_percpu_thread() check.

...

>> I still think that using get_mon_domain_from_cpu() in the workers could work here.
>> Here is the idea more specifically for the MBM overflow handler:
>>
>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
>> index 88f1fa0b9d8d..7186d6d02d6e 100644
>> --- a/fs/resctrl/monitor.c
>> +++ b/fs/resctrl/monitor.c
>> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
>>  		goto out_unlock;
>>  
>>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
>> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
>> +	if (!d)
>> +		goto out_unlock;
> 
> This will get a domain, but it will be for the wrong CPU.

If it was migrated yes. From what I understand when this happens it stops being
a "per-CPU thread" so how about something like:

	mbm_handle_overflow()
	{

		...
		cpus_read_lock();
		mutex_lock(&rdtgroup_mutex);
		
		/*
		 * Overflow handler migrated during race while CPU went offline and
		 * no longer running on intended CPU.
		 */
		if (!is_percpu_thread())
			goto unlock;

		...
	out_unlock:
		mutex_unlock(&rdtgroup_mutex);
		cpus_read_unlock();
	}


I am not clear on whether there are more than one race here now. From the flow
you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
it can assume that its associated domain has not been removed and thus that it is
running with a valid work_struct? More specifically, if is_percpu_thread() is true
then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work? 
I am trying to match this race involving the CPU hotplug lock with the race described
in original changelog that involves rdtgroup_mutex here ...

> 
> Maybe it doesn't hurt for it to perform an extra set of mbm_update()
> calls on that CPU?

The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
One possible issue is the impact on software controller that assumes it is being
called once per second. Looks like an extra call can be avoided though?

> 
> It will then do:
> 
> 	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
> 						   RESCTRL_PICK_ANY_CPU);
> 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
> 
> This "wrong" domain already has a worker ... Will this just reset the
> timeout to the new "delay" value? Possibly also to a different CPU?

queue_delayed_work_on()'s comments mention "Return: %false if @work was
already on a queue" which I interpret as existing (correct) worker not being
impacted. I am not familiar with these corner cases though.

Reinette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-05 21:26         ` Reinette Chatre
@ 2026-05-05 23:07           ` Luck, Tony
  2026-05-06 18:24             ` Reinette Chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2026-05-05 23:07 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

On Tue, May 05, 2026 at 02:26:11PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/5/26 9:45 AM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
> 
> ... 
> > Here's the scenario:
> > 
> > There is just one CPU left in a domain. The mbm_over worker is woken on
> > that CPU just as user requests to take that CPU offline (executing on
> > another CPU).
> > 
> > There is a race.  mbm_handle_overflow() has begun execution, but the
> > offline process has taken cpus_write_lock() so it blocks and sleeps
> > waiting for cpus_read_lock().
> > 
> > The offline process calls:
> > 
> > resctrl_arch_offline_cpu()
> > -> resctrl_offline_cpu
> >     -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
> 
> Here I expect d->mbm_work_cpu to be set to this one CPU that is left in the domain.
> 
> >     -> mbm_setup_overflow_handler(d, 0, cpu);
> >        Finds there are other CPUs in the domain
> 
> I do not think that this will find other CPUs in the domain here since
> the scenario starts with there being only one CPU left in the domain and it is
> currently running the overflow handler . From what I can tell, in this scenario,
> mbm_setup_overflow_handler() will return without scheduling the overflow handler
> on any CPU.
> 
> After mbm_setup_overflow_handler() returns I expect d->mbm_work_cpu to be set to 
> nr_cpus_ids.
> 
> This detail does not impact the race you are highlighting though.
> 
> > -> domain_remove_cpu()
> >    -> domain_remove_cpu_mon()
> >       clears bit for this CPU from hdr->cpu_mask and finds mask is empty
> >       -> resctrl_offline_mon_domain()
> >          -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
> >          -> domain_destroy_l3_mon_state()
> >             -> kfree(d->mbm_states[idx]); /* file system state */
> >       -> removes domain from L3 resource domain list
> >       -> l3_mon_domain_free()
> >          kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
> >          kfree(hw_dom)
> > 
> > Offline process continues. One of the things it does is checks for
> > orphans on the run queue of the now deceased CPU and adopts them.
> > 
> > Finally the offline process completes and does cpus_write_unlock()
> > 
> > Now mbm_handle_overflow() can continue. But it is on the wrong CPU
> > and has a pointer to a work_struct that was freed by the offline
> > process.
> 
> Thank you for the explanation, I did not consider this scenario. From what I understand
> folks encountering this scenario should also encounter a splat from the MBM overflow
> handler from all the places where it runs smp_processor_id(). Well, actually, if these
> people are running with CONFIG_DEBUG_PREEMPT enabled because of the
> debug_smp_processor_id()->check_preemption_disabled()->is_percpu_thread() check.
> 
> ...
> 
> >> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> >> Here is the idea more specifically for the MBM overflow handler:
> >>
> >> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >> index 88f1fa0b9d8d..7186d6d02d6e 100644
> >> --- a/fs/resctrl/monitor.c
> >> +++ b/fs/resctrl/monitor.c
> >> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
> >>  		goto out_unlock;
> >>  
> >>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> >> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> >> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
> >> +	if (!d)
> >> +		goto out_unlock;
> > 
> > This will get a domain, but it will be for the wrong CPU.
> 
> If it was migrated yes. From what I understand when this happens it stops being
> a "per-CPU thread" so how about something like:
> 
> 	mbm_handle_overflow()
> 	{
> 
> 		...
> 		cpus_read_lock();
> 		mutex_lock(&rdtgroup_mutex);
> 		
> 		/*
> 		 * Overflow handler migrated during race while CPU went offline and
> 		 * no longer running on intended CPU.
> 		 */
> 		if (!is_percpu_thread())
> 			goto unlock;

This is neat. It works well in the case with the above race on the last
CPU in a domain going offline. But I think there are problems if there
are other CPUs available, and user takes d->mbm_work_cpu or d->cqm_work_cpu
offline.

Here the corner case semantics of repeated calls to schedule_delayed_work_on()
are important. In this case resctrl_offline_cpu() will call:

	cancel_delayed_work(&d->mbm_over); // Not sync, so running WQ keeps going
	mbm_setup_overflow_handler(d, 0, cpu);
		There are other CPUs available in the domain. Picks one:
		dom->mbm_work_cpu = cpu;
		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);

			This sees that work is already running and returns
			%false without doing anything.

When offline of the CPU completes, the worker runs, finds it is no
longer a "is_percpu_thread()" and returns without scheduling future
execution. So checking for overflow on this domain is disabled.

> 		...
> 	out_unlock:
> 		mutex_unlock(&rdtgroup_mutex);
> 		cpus_read_unlock();
> 	}
> 
> 
> I am not clear on whether there are more than one race here now. From the flow
> you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
> it can assume that its associated domain has not been removed and thus that it is
> running with a valid work_struct? More specifically, if is_percpu_thread() is true
> then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work? 
> I am trying to match this race involving the CPU hotplug lock with the race described
> in original changelog that involves rdtgroup_mutex here ...
> 
> > 
> > Maybe it doesn't hurt for it to perform an extra set of mbm_update()
> > calls on that CPU?
> 
> The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
> One possible issue is the impact on software controller that assumes it is being
> called once per second. Looks like an extra call can be avoided though?
> 
> > 
> > It will then do:
> > 
> > 	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
> > 						   RESCTRL_PICK_ANY_CPU);
> > 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
> > 
> > This "wrong" domain already has a worker ... Will this just reset the
> > timeout to the new "delay" value? Possibly also to a different CPU?
> 
> queue_delayed_work_on()'s comments mention "Return: %false if @work was
> already on a queue" which I interpret as existing (correct) worker not being
> impacted. I am not familiar with these corner cases though.

Yes. Existing worker is not impacted. Which causes the problem described above.

> Reinette

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-05 23:07           ` Luck, Tony
@ 2026-05-06 18:24             ` Reinette Chatre
  2026-05-06 19:48               ` Luck, Tony
  2026-05-06 20:02               ` Luck, Tony
  0 siblings, 2 replies; 16+ messages in thread
From: Reinette Chatre @ 2026-05-06 18:24 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

Hi Tony,

On 5/5/26 4:07 PM, Luck, Tony wrote:
> On Tue, May 05, 2026 at 02:26:11PM -0700, Reinette Chatre wrote:
>> On 5/5/26 9:45 AM, Luck, Tony wrote:
>>> On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:


>>>> I still think that using get_mon_domain_from_cpu() in the workers could work here.
>>>> Here is the idea more specifically for the MBM overflow handler:
>>>>
>>>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
>>>> index 88f1fa0b9d8d..7186d6d02d6e 100644
>>>> --- a/fs/resctrl/monitor.c
>>>> +++ b/fs/resctrl/monitor.c
>>>> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
>>>>  		goto out_unlock;
>>>>  
>>>>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>>>> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
>>>> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
>>>> +	if (!d)
>>>> +		goto out_unlock;
>>>
>>> This will get a domain, but it will be for the wrong CPU.
>>
>> If it was migrated yes. From what I understand when this happens it stops being
>> a "per-CPU thread" so how about something like:
>>
>> 	mbm_handle_overflow()
>> 	{
>>
>> 		...
>> 		cpus_read_lock();
>> 		mutex_lock(&rdtgroup_mutex);
>> 		
>> 		/*
>> 		 * Overflow handler migrated during race while CPU went offline and
>> 		 * no longer running on intended CPU.
>> 		 */
>> 		if (!is_percpu_thread())
>> 			goto unlock;
> 
> This is neat. It works well in the case with the above race on the last
> CPU in a domain going offline. But I think there are problems if there
> are other CPUs available, and user takes d->mbm_work_cpu or d->cqm_work_cpu
> offline.
> 
> Here the corner case semantics of repeated calls to schedule_delayed_work_on()

(I ended up creating a small test module to confirm behaviors discussed here)

> are important. In this case resctrl_offline_cpu() will call:
> 
> 	cancel_delayed_work(&d->mbm_over); // Not sync, so running WQ keeps going

Right. cancel_delayed_work() returns "false" in this particular scenario where the
worker is running but blocked on cpus_read_lock(). Since returning false in this case
is described in cancel_delayed_work()'s function comments I believe that it can be relied
upon as a check whether worker is running. There is also "work_busy()" that can be used as
extra confirmation that workload is actively running with example usage in
bpf_async_cb_rcu_tasks_trace_free() but that does not seem to be necessary in this
scenario.


> 	mbm_setup_overflow_handler(d, 0, cpu);
> 		There are other CPUs available in the domain. Picks one:
> 		dom->mbm_work_cpu = cpu;
> 		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> 
> 			This sees that work is already running and returns
> 			%false without doing anything.

This claim ("returns %false without doing anything") does not match actual behavior.
Experiment showed that the work is indeed scheduled but interestingly it is not
scheduled on the intended CPU per "cpu" parameter to schedule_delayed_work_on() but
instead it is scheduled on the CPU on which the worker is currently running and blocked.
This seems to be a feature of workqueue.

> 
> When offline of the CPU completes, the worker runs, finds it is no
> longer a "is_percpu_thread()" and returns without scheduling future
> execution. So checking for overflow on this domain is disabled.

...

> 
>> 		...
>> 	out_unlock:
>> 		mutex_unlock(&rdtgroup_mutex);
>> 		cpus_read_unlock();
>> 	}
>>
>>
>> I am not clear on whether there are more than one race here now. From the flow
>> you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
>> it can assume that its associated domain has not been removed and thus that it is
>> running with a valid work_struct? More specifically, if is_percpu_thread() is true
>> then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work? 
>> I am trying to match this race involving the CPU hotplug lock with the race described
>> in original changelog that involves rdtgroup_mutex here ...
>>
>>>
>>> Maybe it doesn't hurt for it to perform an extra set of mbm_update()
>>> calls on that CPU?
>>
>> The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
>> One possible issue is the impact on software controller that assumes it is being
>> called once per second. Looks like an extra call can be avoided though?
>>
>>>
>>> It will then do:
>>>
>>> 	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
>>> 						   RESCTRL_PICK_ANY_CPU);
>>> 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
>>>
>>> This "wrong" domain already has a worker ... Will this just reset the
>>> timeout to the new "delay" value? Possibly also to a different CPU?
>>
>> queue_delayed_work_on()'s comments mention "Return: %false if @work was
>> already on a queue" which I interpret as existing (correct) worker not being
>> impacted. I am not familiar with these corner cases though.
> 
> Yes. Existing worker is not impacted. Which causes the problem described above.

schedule_delayed_work_on() will schedule the work but will do so on CPU going
offline. Does not seem as though schedule_delayed_work_on() should be used at all
if the worker is currently running. As an alternative, when it finds that it cannot
cancel the work resctrl can avoid attempting to reschedule the work and instead just
set rdt_l3_mon_domain::mbm_work_cpu to nr_cpu_ids to signal that this domain needs a
worker to be scheduled and that to be done by the exiting work.

Combining the previous ideas with the results from experiments I think the following
may address the problem for MBM overflow handler, not expanded to include limbo handler
and untested:

diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9fd901c78dc6..2e54042b7ee9 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -852,6 +852,30 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+
+	/*
+	 * Worker was blocked waiting for the CPU it was running on to go
+	 * offline. Handle two scenarios:
+	 * - Worker was running on the last CPU of a domain. The domain and
+	 *   thus the work_struct has been freed so do not attempt to obtain
+	 *   domain via container_of(). All remaining domains have overflow
+	 *   handlers so the loop will not find any domains needing an
+	 *   overflow handler. Just exit.
+	 * - Worker was running on CPU that just went offline with other
+	 *   CPUs in domain still running and available to take over the
+	 *   worker. Offline handler could not schedule a new worker on
+	 *   another CPU in the domain but signaled that this needs to be
+	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
+	 *   that needs a worker and schedule it now.
+	 */
+	if (!is_percpu_thread()) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			if (d->mbm_work_cpu == nr_cpu_ids)
+				mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU);
+		}
+		goto out_unlock;
+	}
+
 	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 02f87c4bc03c..cc8620ace7ed 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4539,8 +4539,19 @@ void resctrl_offline_cpu(unsigned int cpu)
 	d = get_mon_domain_from_cpu(cpu, l3);
 	if (d) {
 		if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) {
-			cancel_delayed_work(&d->mbm_over);
-			mbm_setup_overflow_handler(d, 0, cpu);
+			if (cancel_delayed_work(&d->mbm_over)) {
+				mbm_setup_overflow_handler(d, 0, cpu);
+			} else {
+				/*
+				 * Unable to schedule work on new CPU if it
+				 * is currently running since the re-schedule
+				 * will just force new work to run on
+				 * current CPU. Mark domain's worker as
+				 * needing to be rescheduled to be handled
+				 * by worker itself.
+				 */
+				d->mbm_work_cpu = nr_cpu_ids;
+			}
 		}
 		if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
 		    cpu == d->cqm_work_cpu && has_busy_rmid(d)) {




^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 18:24             ` Reinette Chatre
@ 2026-05-06 19:48               ` Luck, Tony
  2026-05-06 21:45                 ` Reinette Chatre
  2026-05-06 20:02               ` Luck, Tony
  1 sibling, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2026-05-06 19:48 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

On Wed, May 06, 2026 at 11:24:30AM -0700, Reinette Chatre wrote:

... trimmed discussion on how we got here ...

> schedule_delayed_work_on() will schedule the work but will do so on CPU going
> offline. Does not seem as though schedule_delayed_work_on() should be used at all
> if the worker is currently running. As an alternative, when it finds that it cannot
> cancel the work resctrl can avoid attempting to reschedule the work and instead just
> set rdt_l3_mon_domain::mbm_work_cpu to nr_cpu_ids to signal that this domain needs a
> worker to be scheduled and that to be done by the exiting work.
> 
> Combining the previous ideas with the results from experiments I think the following
> may address the problem for MBM overflow handler, not expanded to include limbo handler
> and untested:

Initial testing seems good. I added a big mdelay() in mbm_handle_overflow() 
before cpus_read_lock() to make it easy to hit the case where cancel_delayed_work()
fails. Tested both the "still have remaining CPUs in the domain" and "this is 
last cpu" case for both success and fail of cancel_delayed_work().

It looks to me that resctrl_offline_cpu() handles this completely and
the additional cancel_delayed_work() calls from resctrl_offline_mon_domain()
aren't needed.

Do you agree that those can be deleted?

I'll look at fixing the cqm_limbo path in the same style.

> 
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 9fd901c78dc6..2e54042b7ee9 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -852,6 +852,30 @@ void mbm_handle_overflow(struct work_struct *work)
>  		goto out_unlock;
>  
>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> +
> +	/*
> +	 * Worker was blocked waiting for the CPU it was running on to go
> +	 * offline. Handle two scenarios:
> +	 * - Worker was running on the last CPU of a domain. The domain and
> +	 *   thus the work_struct has been freed so do not attempt to obtain
> +	 *   domain via container_of(). All remaining domains have overflow
> +	 *   handlers so the loop will not find any domains needing an
> +	 *   overflow handler. Just exit.
> +	 * - Worker was running on CPU that just went offline with other
> +	 *   CPUs in domain still running and available to take over the
> +	 *   worker. Offline handler could not schedule a new worker on
> +	 *   another CPU in the domain but signaled that this needs to be
> +	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
> +	 *   that needs a worker and schedule it now.
> +	 */
> +	if (!is_percpu_thread()) {
> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +			if (d->mbm_work_cpu == nr_cpu_ids)
> +				mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU);
> +		}
> +		goto out_unlock;
> +	}
> +
>  	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
>  
>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 02f87c4bc03c..cc8620ace7ed 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -4539,8 +4539,19 @@ void resctrl_offline_cpu(unsigned int cpu)
>  	d = get_mon_domain_from_cpu(cpu, l3);
>  	if (d) {
>  		if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> -			cancel_delayed_work(&d->mbm_over);
> -			mbm_setup_overflow_handler(d, 0, cpu);
> +			if (cancel_delayed_work(&d->mbm_over)) {
> +				mbm_setup_overflow_handler(d, 0, cpu);
> +			} else {
> +				/*
> +				 * Unable to schedule work on new CPU if it
> +				 * is currently running since the re-schedule
> +				 * will just force new work to run on
> +				 * current CPU. Mark domain's worker as
> +				 * needing to be rescheduled to be handled
> +				 * by worker itself.
> +				 */
> +				d->mbm_work_cpu = nr_cpu_ids;
> +			}
>  		}
>  		if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
>  		    cpu == d->cqm_work_cpu && has_busy_rmid(d)) {
> 
> 

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 18:24             ` Reinette Chatre
  2026-05-06 19:48               ` Luck, Tony
@ 2026-05-06 20:02               ` Luck, Tony
  2026-05-06 20:33                 ` Reinette Chatre
  1 sibling, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2026-05-06 20:02 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

Question?

> +	if (!is_percpu_thread()) {
> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +			if (d->mbm_work_cpu == nr_cpu_ids)
> +				mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU);

Should that "MBM_OVERFLOW_INTERVAL" be "0"? This worker is presumably
already slightly late because of the offline CPU overhead and time to
be picked up by another CPU. Maybe it should run right away on whatever
new CPU in the domain is picked?

> +		}
> +		goto out_unlock;
> +	}
> +

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 20:02               ` Luck, Tony
@ 2026-05-06 20:33                 ` Reinette Chatre
  2026-05-06 20:52                   ` Luck, Tony
  0 siblings, 1 reply; 16+ messages in thread
From: Reinette Chatre @ 2026-05-06 20:33 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

Hi Tony,

On 5/6/26 1:02 PM, Luck, Tony wrote:
> Question?
> 
>> +	if (!is_percpu_thread()) {
>> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
>> +			if (d->mbm_work_cpu == nr_cpu_ids)
>> +				mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU);
> 
> Should that "MBM_OVERFLOW_INTERVAL" be "0"? This worker is presumably
> already slightly late because of the offline CPU overhead and time to
> be picked up by another CPU. Maybe it should run right away on whatever
> new CPU in the domain is picked?

The delay is intentionally _not_ zero and there should probably be a comment
to make that clear. My module experiment demonstrated that when the work associated
with the work_struct is already running then no matter which CPU is provided as parameter
to schedule_delayed_work_on() the workqueue handling will schedule the work on the same
CPU as the currently executing work. Second time around is_percpu_thread() will still be
false but this time mbm_work_cpu will be set to CPU it should have been scheduled to and
work will exit without re-arming the worker and the associated domain loses its worker.

By setting the delay to MBM_OVERFLOW_INTERVAL it guarantees that the current executing
worker will be done by the time the newly scheduled worker should run and thus
be scheduled on correct CPU. I assume you are hinting that if the memory bandwidth is
under pressure there may thus be a risk that an overflow occurred? Perhaps
MBM_OVERFLOW_INTERVAL is too big - the delay only needs to be big enough to ensure that
current worker is done before new worker is scheduled. Do you have suggestions?

Reinette



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 20:33                 ` Reinette Chatre
@ 2026-05-06 20:52                   ` Luck, Tony
  0 siblings, 0 replies; 16+ messages in thread
From: Luck, Tony @ 2026-05-06 20:52 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Borislav Petkov, x86@kernel.org, Fenghua Yu,
	Wieczor-Retman, Maciej, Peter Newman, James Morse, Babu Moger,
	Drew Fustini, Dave Martin, Chen, Yu C,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

> > Question?
> >
> >> +  if (!is_percpu_thread()) {
> >> +          list_for_each_entry(d, &r->mon_domains, hdr.list) {
> >> +                  if (d->mbm_work_cpu == nr_cpu_ids)
> >> +                          mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU);
> >
> > Should that "MBM_OVERFLOW_INTERVAL" be "0"? This worker is presumably
> > already slightly late because of the offline CPU overhead and time to
> > be picked up by another CPU. Maybe it should run right away on whatever
> > new CPU in the domain is picked?
>
> The delay is intentionally _not_ zero and there should probably be a comment
> to make that clear. My module experiment demonstrated that when the work associated
> with the work_struct is already running then no matter which CPU is provided as parameter
> to schedule_delayed_work_on() the workqueue handling will schedule the work on the same
> CPU as the currently executing work. Second time around is_percpu_thread() will still be
> false but this time mbm_work_cpu will be set to CPU it should have been scheduled to and
> work will exit without re-arming the worker and the associated domain loses its worker.
>
> By setting the delay to MBM_OVERFLOW_INTERVAL it guarantees that the current executing
> worker will be done by the time the newly scheduled worker should run and thus
> be scheduled on correct CPU. I assume you are hinting that if the memory bandwidth is
> under pressure there may thus be a risk that an overflow occurred? Perhaps
> MBM_OVERFLOW_INTERVAL is too big - the delay only needs to be big enough to ensure that
> current worker is done before new worker is scheduled. Do you have suggestions?

I'd missed that "0" means "ignore cpu and run this right away". I'll keep it at MBM_OVERFLOW_INTERVAL
(and use CQM_LIMBOCHECK_INTERVAL for the cqm_limbo case).

While running late is not ideal, the MBM_OVERFLOW_INTERVAL was chosen in a conservative
way for older Intel systems with 24-bit MBM counters. A gap of twice that is unlikely to cause issues
on those systems. Modern (Icelake and newer) Intel systems have 32-bit counters and are
completely safe, I believe that AMD counters are also wide enough that a 2-second interval is
safe. Babu???

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 19:48               ` Luck, Tony
@ 2026-05-06 21:45                 ` Reinette Chatre
  2026-05-06 22:11                   ` Luck, Tony
  0 siblings, 1 reply; 16+ messages in thread
From: Reinette Chatre @ 2026-05-06 21:45 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
	Chen Yu, linux-kernel, patches

Hi Tony,

On 5/6/26 12:48 PM, Luck, Tony wrote:
> On Wed, May 06, 2026 at 11:24:30AM -0700, Reinette Chatre wrote:
> 
> ... trimmed discussion on how we got here ...
> 
>> schedule_delayed_work_on() will schedule the work but will do so on CPU going
>> offline. Does not seem as though schedule_delayed_work_on() should be used at all
>> if the worker is currently running. As an alternative, when it finds that it cannot
>> cancel the work resctrl can avoid attempting to reschedule the work and instead just
>> set rdt_l3_mon_domain::mbm_work_cpu to nr_cpu_ids to signal that this domain needs a
>> worker to be scheduled and that to be done by the exiting work.
>>
>> Combining the previous ideas with the results from experiments I think the following
>> may address the problem for MBM overflow handler, not expanded to include limbo handler
>> and untested:
> 
> Initial testing seems good. I added a big mdelay() in mbm_handle_overflow() 
> before cpus_read_lock() to make it easy to hit the case where cancel_delayed_work()
> fails. Tested both the "still have remaining CPUs in the domain" and "this is 
> last cpu" case for both success and fail of cancel_delayed_work().

Thank you very much for the testing.

> 
> It looks to me that resctrl_offline_cpu() handles this completely and
> the additional cancel_delayed_work() calls from resctrl_offline_mon_domain()
> aren't needed.
> 
> Do you agree that those can be deleted?

Good catch. I am not able to think of a scenario where this is still needed. The new
flow opens up some new scenarios, for example when the last *two* CPUs of a domain go
offline while the worker is blocked on cpus_read_lock() and worker not getting opportunity
to transition. Even then, when the MBM overflow handling code in resctrl_offline_cpu()
is totally skipped for one CPU the cancel_delayed_work() in resctrl_offline_mon_domain()
seems unnecessary to me. 

Could you perhaps ask the AI agent that assisted with original patch if it can find any
corner cases?

Unrelated to this question but may be worth a mention in the fix is that this work focuses
and fixes resctrl to not access freed memory from the worker self. To complement this it may
be worthwhile to highlight that it is safe for the work_struct self to be deleted while the
work is running (but blocked on cpus_read_lock()) based on the following comment from
kernel/workqueue.c:process_one_work():
"It is permissible to free the struct work_struct from inside the function that is called
from it ..."

> 
> I'll look at fixing the cqm_limbo path in the same style.

Thank you very much.

Reinette


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 21:45                 ` Reinette Chatre
@ 2026-05-06 22:11                   ` Luck, Tony
  2026-05-06 22:28                     ` Reinette Chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Luck, Tony @ 2026-05-06 22:11 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Borislav Petkov, x86@kernel.org, Fenghua Yu,
	Wieczor-Retman, Maciej, Peter Newman, James Morse, Babu Moger,
	Drew Fustini, Dave Martin, Chen, Yu C,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

> > ... trimmed discussion on how we got here ...
> >
> >> schedule_delayed_work_on() will schedule the work but will do so on CPU going
> >> offline. Does not seem as though schedule_delayed_work_on() should be used at all
> >> if the worker is currently running. As an alternative, when it finds that it cannot
> >> cancel the work resctrl can avoid attempting to reschedule the work and instead just
> >> set rdt_l3_mon_domain::mbm_work_cpu to nr_cpu_ids to signal that this domain needs a
> >> worker to be scheduled and that to be done by the exiting work.
> >>
> >> Combining the previous ideas with the results from experiments I think the following
> >> may address the problem for MBM overflow handler, not expanded to include limbo handler
> >> and untested:
> >
> > Initial testing seems good. I added a big mdelay() in mbm_handle_overflow()
> > before cpus_read_lock() to make it easy to hit the case where cancel_delayed_work()
> > fails. Tested both the "still have remaining CPUs in the domain" and "this is
> > last cpu" case for both success and fail of cancel_delayed_work().
>
> Thank you very much for the testing.
>
> >
> > It looks to me that resctrl_offline_cpu() handles this completely and
> > the additional cancel_delayed_work() calls from resctrl_offline_mon_domain()
> > aren't needed.
> >
> > Do you agree that those can be deleted?
>
> Good catch. I am not able to think of a scenario where this is still needed. The new
> flow opens up some new scenarios, for example when the last *two* CPUs of a domain go
> offline while the worker is blocked on cpus_read_lock() and worker not getting opportunity
> to transition. Even then, when the MBM overflow handling code in resctrl_offline_cpu()
> is totally skipped for one CPU the cancel_delayed_work() in resctrl_offline_mon_domain()
> seems unnecessary to me.
>
> Could you perhaps ask the AI agent that assisted with original patch if it can find any
> corner cases?

I've got a couple of AI's spinning now. I'll resume context for the one that offered the v1
patch and ask it what it thinks of this new version.

>
> Unrelated to this question but may be worth a mention in the fix is that this work focuses
> and fixes resctrl to not access freed memory from the worker self. To complement this it may
> be worthwhile to highlight that it is safe for the work_struct self to be deleted while the
> work is running (but blocked on cpus_read_lock()) based on the following comment from
> kernel/workqueue.c:process_one_work():
> "It is permissible to free the struct work_struct from inside the function that is called
> from it ..."

Scope increased from just the use-after-free when the domain was deleted. The case
for taking the current worker CPU offline doesn't involve a use-after-free. It just results
in running the workier on the wrong CPU for one iteration.

Deleting the work_struct inside the called function is different from some agent deleting
the work_struct while the worker is running.

>
> >
> > I'll look at fixing the cqm_limbo path in the same style.

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 22:11                   ` Luck, Tony
@ 2026-05-06 22:28                     ` Reinette Chatre
  2026-05-06 23:14                       ` Luck, Tony
  0 siblings, 1 reply; 16+ messages in thread
From: Reinette Chatre @ 2026-05-06 22:28 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86@kernel.org, Fenghua Yu,
	Wieczor-Retman, Maciej, Peter Newman, James Morse, Babu Moger,
	Drew Fustini, Dave Martin, Chen, Yu C,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 5/6/26 3:11 PM, Luck, Tony wrote:
>> Unrelated to this question but may be worth a mention in the fix is that this work focuses
>> and fixes resctrl to not access freed memory from the worker self. To complement this it may
>> be worthwhile to highlight that it is safe for the work_struct self to be deleted while the
>> work is running (but blocked on cpus_read_lock()) based on the following comment from
>> kernel/workqueue.c:process_one_work():
>> "It is permissible to free the struct work_struct from inside the function that is called
>> from it ..."
> 
> Scope increased from just the use-after-free when the domain was deleted. The case
> for taking the current worker CPU offline doesn't involve a use-after-free. It just results
> in running the workier on the wrong CPU for one iteration.
> 
> Deleting the work_struct inside the called function is different from some agent deleting
> the work_struct while the worker is running.

Right. I interpret this to mean that judging the safety of work_struct removal should consider not
only the workqueue API itself but also external agents that may access the work_struct after its
removal. The current fix addresses access to removed work_struct from within worker itself while I
interpret the workqueue API to guarantee that there will be no access to work_struct during or
after worker execution. The fix under development thus makes it possible to safely remove the
domain even if a worker belonging to it is executing and blocked on cpus_read_lock(). Do you
see any remaining issues here?

Reinette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
  2026-05-06 22:28                     ` Reinette Chatre
@ 2026-05-06 23:14                       ` Luck, Tony
  0 siblings, 0 replies; 16+ messages in thread
From: Luck, Tony @ 2026-05-06 23:14 UTC (permalink / raw)
  To: Chatre, Reinette
  Cc: Borislav Petkov, x86@kernel.org, Fenghua Yu,
	Wieczor-Retman, Maciej, Peter Newman, James Morse, Babu Moger,
	Drew Fustini, Dave Martin, Chen, Yu C,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

> >> Unrelated to this question but may be worth a mention in the fix is that this work focuses
> >> and fixes resctrl to not access freed memory from the worker self. To complement this it may
> >> be worthwhile to highlight that it is safe for the work_struct self to be deleted while the
> >> work is running (but blocked on cpus_read_lock()) based on the following comment from
> >> kernel/workqueue.c:process_one_work():
> >> "It is permissible to free the struct work_struct from inside the function that is called
> >> from it ..."
> >
> > Scope increased from just the use-after-free when the domain was deleted. The case
> > for taking the current worker CPU offline doesn't involve a use-after-free. It just results
> > in running the workier on the wrong CPU for one iteration.
> >
> > Deleting the work_struct inside the called function is different from some agent deleting
> > the work_struct while the worker is running.
>
> Right. I interpret this to mean that judging the safety of work_struct removal should consider not
> only the workqueue API itself but also external agents that may access the work_struct after its
> removal. The current fix addresses access to removed work_struct from within worker itself while I
> interpret the workqueue API to guarantee that there will be no access to work_struct during or
> after worker execution. The fix under development thus makes it possible to safely remove the
> domain even if a worker belonging to it is executing and blocked on cpus_read_lock(). Do you
> see any remaining issues here?

OK. I'll add something to the commit message.

I asked my original AI about this fix. It claimed to find problems relating to kernel using the work_struct
after return from the function. Pasting in that comment you gave me from process_one_work() about
it being OK to free the work_struct made it reconsider and retract.

Another AI (using a copy of the sashiko rules) has found an issue with our reliance on is_percpu_thread()

The problem is the ordering of hotplug callbacks.

resctrl_arch_offline_cpu() runs early because it is in the CPUHP_AP_ONLINE_DYN class. AI claims
that cpus_write_lock() is released after running this, but before running workqueue_offline_cpu() in the
CPUHP_AP_WORKQUEUE_ONLINE class.

So our worker may obtain cpus_read_lock() and not yet lost its_percpu_thread() status.


-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-06 23:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
2026-05-04 15:11 ` Reinette Chatre
2026-05-04 22:50   ` Luck, Tony
2026-05-05  4:39     ` Reinette Chatre
2026-05-05 16:45       ` Luck, Tony
2026-05-05 21:26         ` Reinette Chatre
2026-05-05 23:07           ` Luck, Tony
2026-05-06 18:24             ` Reinette Chatre
2026-05-06 19:48               ` Luck, Tony
2026-05-06 21:45                 ` Reinette Chatre
2026-05-06 22:11                   ` Luck, Tony
2026-05-06 22:28                     ` Reinette Chatre
2026-05-06 23:14                       ` Luck, Tony
2026-05-06 20:02               ` Luck, Tony
2026-05-06 20:33                 ` Reinette Chatre
2026-05-06 20:52                   ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox