Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Luck, Tony" <tony.luck@intel.com>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: Borislav Petkov <bp@alien8.de>, <x86@kernel.org>,
	Fenghua Yu <fenghuay@nvidia.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	"Drew Fustini" <dfustini@baylibre.com>,
	Dave Martin <Dave.Martin@arm.com>, Chen Yu <yu.c.chen@intel.com>,
	<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
Date: Tue, 5 May 2026 16:07:38 -0700	[thread overview]
Message-ID: <afp4OkwtPpyIwAC1@agluck-desk3> (raw)
In-Reply-To: <7fad1d7d-c892-416e-b97a-a230fd43f2a4@intel.com>

On Tue, May 05, 2026 at 02:26:11PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/5/26 9:45 AM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
> 
> ... 
> > Here's the scenario:
> > 
> > There is just one CPU left in a domain. The mbm_over worker is woken on
> > that CPU just as user requests to take that CPU offline (executing on
> > another CPU).
> > 
> > There is a race.  mbm_handle_overflow() has begun execution, but the
> > offline process has taken cpus_write_lock() so it blocks and sleeps
> > waiting for cpus_read_lock().
> > 
> > The offline process calls:
> > 
> > resctrl_arch_offline_cpu()
> > -> resctrl_offline_cpu
> >     -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
> 
> Here I expect d->mbm_work_cpu to be set to this one CPU that is left in the domain.
> 
> >     -> mbm_setup_overflow_handler(d, 0, cpu);
> >        Finds there are other CPUs in the domain
> 
> I do not think that this will find other CPUs in the domain here since
> the scenario starts with there being only one CPU left in the domain and it is
> currently running the overflow handler . From what I can tell, in this scenario,
> mbm_setup_overflow_handler() will return without scheduling the overflow handler
> on any CPU.
> 
> After mbm_setup_overflow_handler() returns I expect d->mbm_work_cpu to be set to 
> nr_cpus_ids.
> 
> This detail does not impact the race you are highlighting though.
> 
> > -> domain_remove_cpu()
> >    -> domain_remove_cpu_mon()
> >       clears bit for this CPU from hdr->cpu_mask and finds mask is empty
> >       -> resctrl_offline_mon_domain()
> >          -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
> >          -> domain_destroy_l3_mon_state()
> >             -> kfree(d->mbm_states[idx]); /* file system state */
> >       -> removes domain from L3 resource domain list
> >       -> l3_mon_domain_free()
> >          kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
> >          kfree(hw_dom)
> > 
> > Offline process continues. One of the things it does is checks for
> > orphans on the run queue of the now deceased CPU and adopts them.
> > 
> > Finally the offline process completes and does cpus_write_unlock()
> > 
> > Now mbm_handle_overflow() can continue. But it is on the wrong CPU
> > and has a pointer to a work_struct that was freed by the offline
> > process.
> 
> Thank you for the explanation, I did not consider this scenario. From what I understand
> folks encountering this scenario should also encounter a splat from the MBM overflow
> handler from all the places where it runs smp_processor_id(). Well, actually, if these
> people are running with CONFIG_DEBUG_PREEMPT enabled because of the
> debug_smp_processor_id()->check_preemption_disabled()->is_percpu_thread() check.
> 
> ...
> 
> >> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> >> Here is the idea more specifically for the MBM overflow handler:
> >>
> >> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >> index 88f1fa0b9d8d..7186d6d02d6e 100644
> >> --- a/fs/resctrl/monitor.c
> >> +++ b/fs/resctrl/monitor.c
> >> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
> >>  		goto out_unlock;
> >>  
> >>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> >> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> >> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
> >> +	if (!d)
> >> +		goto out_unlock;
> > 
> > This will get a domain, but it will be for the wrong CPU.
> 
> If it was migrated yes. From what I understand when this happens it stops being
> a "per-CPU thread" so how about something like:
> 
> 	mbm_handle_overflow()
> 	{
> 
> 		...
> 		cpus_read_lock();
> 		mutex_lock(&rdtgroup_mutex);
> 		
> 		/*
> 		 * Overflow handler migrated during race while CPU went offline and
> 		 * no longer running on intended CPU.
> 		 */
> 		if (!is_percpu_thread())
> 			goto unlock;

This is neat. It works well in the case with the above race on the last
CPU in a domain going offline. But I think there are problems if there
are other CPUs available, and user takes d->mbm_work_cpu or d->cqm_work_cpu
offline.

Here the corner case semantics of repeated calls to schedule_delayed_work_on()
are important. In this case resctrl_offline_cpu() will call:

	cancel_delayed_work(&d->mbm_over); // Not sync, so running WQ keeps going
	mbm_setup_overflow_handler(d, 0, cpu);
		There are other CPUs available in the domain. Picks one:
		dom->mbm_work_cpu = cpu;
		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);

			This sees that work is already running and returns
			%false without doing anything.

When offline of the CPU completes, the worker runs, finds it is no
longer a "is_percpu_thread()" and returns without scheduling future
execution. So checking for overflow on this domain is disabled.

> 		...
> 	out_unlock:
> 		mutex_unlock(&rdtgroup_mutex);
> 		cpus_read_unlock();
> 	}
> 
> 
> I am not clear on whether there are more than one race here now. From the flow
> you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
> it can assume that its associated domain has not been removed and thus that it is
> running with a valid work_struct? More specifically, if is_percpu_thread() is true
> then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work? 
> I am trying to match this race involving the CPU hotplug lock with the race described
> in original changelog that involves rdtgroup_mutex here ...
> 
> > 
> > Maybe it doesn't hurt for it to perform an extra set of mbm_update()
> > calls on that CPU?
> 
> The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
> One possible issue is the impact on software controller that assumes it is being
> called once per second. Looks like an extra call can be avoided though?
> 
> > 
> > It will then do:
> > 
> > 	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
> > 						   RESCTRL_PICK_ANY_CPU);
> > 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
> > 
> > This "wrong" domain already has a worker ... Will this just reset the
> > timeout to the new "delay" value? Possibly also to a different CPU?
> 
> queue_delayed_work_on()'s comments mention "Return: %false if @work was
> already on a queue" which I interpret as existing (correct) worker not being
> impacted. I am not familiar with these corner cases though.

Yes. Existing worker is not impacted. Which causes the problem described above.

> Reinette

-Tony

next prev parent reply	other threads:[~2026-05-05 23:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
2026-05-04 15:11 ` Reinette Chatre
2026-05-04 22:50   ` Luck, Tony
2026-05-05  4:39     ` Reinette Chatre
2026-05-05 16:45       ` Luck, Tony
2026-05-05 21:26         ` Reinette Chatre
2026-05-05 23:07           ` Luck, Tony [this message]
2026-05-06 18:24             ` Reinette Chatre
2026-05-06 19:48               ` Luck, Tony
2026-05-06 21:45                 ` Reinette Chatre
2026-05-06 22:11                   ` Luck, Tony
2026-05-06 22:28                     ` Reinette Chatre
2026-05-06 23:14                       ` Luck, Tony
2026-05-07  3:42                         ` Reinette Chatre
2026-05-07 15:12                           ` Luck, Tony
2026-05-06 20:02               ` Luck, Tony
2026-05-06 20:33                 ` Reinette Chatre
2026-05-06 20:52                   ` Luck, Tony
2026-05-07 15:48               ` Luck, Tony
2026-05-07 17:06                 ` Reinette Chatre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afp4OkwtPpyIwAC1@agluck-desk3 \
    --to=tony.luck@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=babu.moger@amd.com \
    --cc=bp@alien8.de \
    --cc=dfustini@baylibre.com \
    --cc=fenghuay@nvidia.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=peternewman@google.com \
    --cc=reinette.chatre@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.