public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Reinette Chatre <reinette.chatre@intel.com>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>, <x86@kernel.org>,
	Fenghua Yu <fenghuay@nvidia.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	"Drew Fustini" <dfustini@baylibre.com>,
	Dave Martin <Dave.Martin@arm.com>, Chen Yu <yu.c.chen@intel.com>,
	<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
Date: Wed, 6 May 2026 11:24:30 -0700	[thread overview]
Message-ID: <d065f7b7-daac-4e45-b7c9-69175dfb43a7@intel.com> (raw)
In-Reply-To: <afp4OkwtPpyIwAC1@agluck-desk3>

Hi Tony,

On 5/5/26 4:07 PM, Luck, Tony wrote:
> On Tue, May 05, 2026 at 02:26:11PM -0700, Reinette Chatre wrote:
>> On 5/5/26 9:45 AM, Luck, Tony wrote:
>>> On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:


>>>> I still think that using get_mon_domain_from_cpu() in the workers could work here.
>>>> Here is the idea more specifically for the MBM overflow handler:
>>>>
>>>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
>>>> index 88f1fa0b9d8d..7186d6d02d6e 100644
>>>> --- a/fs/resctrl/monitor.c
>>>> +++ b/fs/resctrl/monitor.c
>>>> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
>>>>  		goto out_unlock;
>>>>  
>>>>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
>>>> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
>>>> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
>>>> +	if (!d)
>>>> +		goto out_unlock;
>>>
>>> This will get a domain, but it will be for the wrong CPU.
>>
>> If it was migrated yes. From what I understand when this happens it stops being
>> a "per-CPU thread" so how about something like:
>>
>> 	mbm_handle_overflow()
>> 	{
>>
>> 		...
>> 		cpus_read_lock();
>> 		mutex_lock(&rdtgroup_mutex);
>> 		
>> 		/*
>> 		 * Overflow handler migrated during race while CPU went offline and
>> 		 * no longer running on intended CPU.
>> 		 */
>> 		if (!is_percpu_thread())
>> 			goto unlock;
> 
> This is neat. It works well in the case with the above race on the last
> CPU in a domain going offline. But I think there are problems if there
> are other CPUs available, and user takes d->mbm_work_cpu or d->cqm_work_cpu
> offline.
> 
> Here the corner case semantics of repeated calls to schedule_delayed_work_on()

(I ended up creating a small test module to confirm behaviors discussed here)

> are important. In this case resctrl_offline_cpu() will call:
> 
> 	cancel_delayed_work(&d->mbm_over); // Not sync, so running WQ keeps going

Right. cancel_delayed_work() returns "false" in this particular scenario where the
worker is running but blocked on cpus_read_lock(). Since returning false in this case
is described in cancel_delayed_work()'s function comments I believe that it can be relied
upon as a check whether worker is running. There is also "work_busy()" that can be used as
extra confirmation that workload is actively running with example usage in
bpf_async_cb_rcu_tasks_trace_free() but that does not seem to be necessary in this
scenario.


> 	mbm_setup_overflow_handler(d, 0, cpu);
> 		There are other CPUs available in the domain. Picks one:
> 		dom->mbm_work_cpu = cpu;
> 		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> 
> 			This sees that work is already running and returns
> 			%false without doing anything.

This claim ("returns %false without doing anything") does not match actual behavior.
Experiment showed that the work is indeed scheduled but interestingly it is not
scheduled on the intended CPU per "cpu" parameter to schedule_delayed_work_on() but
instead it is scheduled on the CPU on which the worker is currently running and blocked.
This seems to be a feature of workqueue.

> 
> When offline of the CPU completes, the worker runs, finds it is no
> longer a "is_percpu_thread()" and returns without scheduling future
> execution. So checking for overflow on this domain is disabled.

...

> 
>> 		...
>> 	out_unlock:
>> 		mutex_unlock(&rdtgroup_mutex);
>> 		cpus_read_unlock();
>> 	}
>>
>>
>> I am not clear on whether there are more than one race here now. From the flow
>> you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
>> it can assume that its associated domain has not been removed and thus that it is
>> running with a valid work_struct? More specifically, if is_percpu_thread() is true
>> then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work? 
>> I am trying to match this race involving the CPU hotplug lock with the race described
>> in original changelog that involves rdtgroup_mutex here ...
>>
>>>
>>> Maybe it doesn't hurt for it to perform an extra set of mbm_update()
>>> calls on that CPU?
>>
>> The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
>> One possible issue is the impact on software controller that assumes it is being
>> called once per second. Looks like an extra call can be avoided though?
>>
>>>
>>> It will then do:
>>>
>>> 	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
>>> 						   RESCTRL_PICK_ANY_CPU);
>>> 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
>>>
>>> This "wrong" domain already has a worker ... Will this just reset the
>>> timeout to the new "delay" value? Possibly also to a different CPU?
>>
>> queue_delayed_work_on()'s comments mention "Return: %false if @work was
>> already on a queue" which I interpret as existing (correct) worker not being
>> impacted. I am not familiar with these corner cases though.
> 
> Yes. Existing worker is not impacted. Which causes the problem described above.

schedule_delayed_work_on() will schedule the work but will do so on CPU going
offline. Does not seem as though schedule_delayed_work_on() should be used at all
if the worker is currently running. As an alternative, when it finds that it cannot
cancel the work resctrl can avoid attempting to reschedule the work and instead just
set rdt_l3_mon_domain::mbm_work_cpu to nr_cpu_ids to signal that this domain needs a
worker to be scheduled and that to be done by the exiting work.

Combining the previous ideas with the results from experiments I think the following
may address the problem for MBM overflow handler, not expanded to include limbo handler
and untested:

diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9fd901c78dc6..2e54042b7ee9 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -852,6 +852,30 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+
+	/*
+	 * Worker was blocked waiting for the CPU it was running on to go
+	 * offline. Handle two scenarios:
+	 * - Worker was running on the last CPU of a domain. The domain and
+	 *   thus the work_struct has been freed so do not attempt to obtain
+	 *   domain via container_of(). All remaining domains have overflow
+	 *   handlers so the loop will not find any domains needing an
+	 *   overflow handler. Just exit.
+	 * - Worker was running on CPU that just went offline with other
+	 *   CPUs in domain still running and available to take over the
+	 *   worker. Offline handler could not schedule a new worker on
+	 *   another CPU in the domain but signaled that this needs to be
+	 *   done by setting mbm_work_cpu to nr_cpu_ids. Find the domain
+	 *   that needs a worker and schedule it now.
+	 */
+	if (!is_percpu_thread()) {
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			if (d->mbm_work_cpu == nr_cpu_ids)
+				mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU);
+		}
+		goto out_unlock;
+	}
+
 	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 02f87c4bc03c..cc8620ace7ed 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4539,8 +4539,19 @@ void resctrl_offline_cpu(unsigned int cpu)
 	d = get_mon_domain_from_cpu(cpu, l3);
 	if (d) {
 		if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) {
-			cancel_delayed_work(&d->mbm_over);
-			mbm_setup_overflow_handler(d, 0, cpu);
+			if (cancel_delayed_work(&d->mbm_over)) {
+				mbm_setup_overflow_handler(d, 0, cpu);
+			} else {
+				/*
+				 * Unable to schedule work on new CPU if it
+				 * is currently running since the re-schedule
+				 * will just force new work to run on
+				 * current CPU. Mark domain's worker as
+				 * needing to be rescheduled to be handled
+				 * by worker itself.
+				 */
+				d->mbm_work_cpu = nr_cpu_ids;
+			}
 		}
 		if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) &&
 		    cpu == d->cqm_work_cpu && has_busy_rmid(d)) {




  reply	other threads:[~2026-05-06 18:24 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
2026-05-04 15:11 ` Reinette Chatre
2026-05-04 22:50   ` Luck, Tony
2026-05-05  4:39     ` Reinette Chatre
2026-05-05 16:45       ` Luck, Tony
2026-05-05 21:26         ` Reinette Chatre
2026-05-05 23:07           ` Luck, Tony
2026-05-06 18:24             ` Reinette Chatre [this message]
2026-05-06 19:48               ` Luck, Tony
2026-05-06 20:02               ` Luck, Tony

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d065f7b7-daac-4e45-b7c9-69175dfb43a7@intel.com \
    --to=reinette.chatre@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=babu.moger@amd.com \
    --cc=bp@alien8.de \
    --cc=dfustini@baylibre.com \
    --cc=fenghuay@nvidia.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=peternewman@google.com \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox