public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Luck, Tony" <tony.luck@intel.com>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: Borislav Petkov <bp@alien8.de>, <x86@kernel.org>,
	Fenghua Yu <fenghuay@nvidia.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	"Drew Fustini" <dfustini@baylibre.com>,
	Dave Martin <Dave.Martin@arm.com>, Chen Yu <yu.c.chen@intel.com>,
	<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
Date: Tue, 5 May 2026 16:07:38 -0700	[thread overview]
Message-ID: <afp4OkwtPpyIwAC1@agluck-desk3> (raw)
In-Reply-To: <7fad1d7d-c892-416e-b97a-a230fd43f2a4@intel.com>

On Tue, May 05, 2026 at 02:26:11PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/5/26 9:45 AM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
> 
> ... 
> > Here's the scenario:
> > 
> > There is just one CPU left in a domain. The mbm_over worker is woken on
> > that CPU just as user requests to take that CPU offline (executing on
> > another CPU).
> > 
> > There is a race.  mbm_handle_overflow() has begun execution, but the
> > offline process has taken cpus_write_lock() so it blocks and sleeps
> > waiting for cpus_read_lock().
> > 
> > The offline process calls:
> > 
> > resctrl_arch_offline_cpu()
> > -> resctrl_offline_cpu
> >     -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
> 
> Here I expect d->mbm_work_cpu to be set to this one CPU that is left in the domain.
> 
> >     -> mbm_setup_overflow_handler(d, 0, cpu);
> >        Finds there are other CPUs in the domain
> 
> I do not think that this will find other CPUs in the domain here since
> the scenario starts with there being only one CPU left in the domain and it is
> currently running the overflow handler . From what I can tell, in this scenario,
> mbm_setup_overflow_handler() will return without scheduling the overflow handler
> on any CPU.
> 
> After mbm_setup_overflow_handler() returns I expect d->mbm_work_cpu to be set to 
> nr_cpus_ids.
> 
> This detail does not impact the race you are highlighting though.
> 
> > -> domain_remove_cpu()
> >    -> domain_remove_cpu_mon()
> >       clears bit for this CPU from hdr->cpu_mask and finds mask is empty
> >       -> resctrl_offline_mon_domain()
> >          -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
> >          -> domain_destroy_l3_mon_state()
> >             -> kfree(d->mbm_states[idx]); /* file system state */
> >       -> removes domain from L3 resource domain list
> >       -> l3_mon_domain_free()
> >          kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
> >          kfree(hw_dom)
> > 
> > Offline process continues. One of the things it does is checks for
> > orphans on the run queue of the now deceased CPU and adopts them.
> > 
> > Finally the offline process completes and does cpus_write_unlock()
> > 
> > Now mbm_handle_overflow() can continue. But it is on the wrong CPU
> > and has a pointer to a work_struct that was freed by the offline
> > process.
> 
> Thank you for the explanation, I did not consider this scenario. From what I understand
> folks encountering this scenario should also encounter a splat from the MBM overflow
> handler from all the places where it runs smp_processor_id(). Well, actually, if these
> people are running with CONFIG_DEBUG_PREEMPT enabled because of the
> debug_smp_processor_id()->check_preemption_disabled()->is_percpu_thread() check.
> 
> ...
> 
> >> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> >> Here is the idea more specifically for the MBM overflow handler:
> >>
> >> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >> index 88f1fa0b9d8d..7186d6d02d6e 100644
> >> --- a/fs/resctrl/monitor.c
> >> +++ b/fs/resctrl/monitor.c
> >> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
> >>  		goto out_unlock;
> >>  
> >>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> >> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> >> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
> >> +	if (!d)
> >> +		goto out_unlock;
> > 
> > This will get a domain, but it will be for the wrong CPU.
> 
> If it was migrated yes. From what I understand when this happens it stops being
> a "per-CPU thread" so how about something like:
> 
> 	mbm_handle_overflow()
> 	{
> 
> 		...
> 		cpus_read_lock();
> 		mutex_lock(&rdtgroup_mutex);
> 		
> 		/*
> 		 * Overflow handler migrated during race while CPU went offline and
> 		 * no longer running on intended CPU.
> 		 */
> 		if (!is_percpu_thread())
> 			goto unlock;

This is neat. It works well in the case with the above race on the last
CPU in a domain going offline. But I think there are problems if there
are other CPUs available, and user takes d->mbm_work_cpu or d->cqm_work_cpu
offline.

Here the corner case semantics of repeated calls to schedule_delayed_work_on()
are important. In this case resctrl_offline_cpu() will call:

	cancel_delayed_work(&d->mbm_over); // Not sync, so running WQ keeps going
	mbm_setup_overflow_handler(d, 0, cpu);
		There are other CPUs available in the domain. Picks one:
		dom->mbm_work_cpu = cpu;
		schedule_delayed_work_on(cpu, &dom->mbm_over, delay);

			This sees that work is already running and returns
			%false without doing anything.

When offline of the CPU completes, the worker runs, finds it is no
longer a "is_percpu_thread()" and returns without scheduling future
execution. So checking for overflow on this domain is disabled.

> 		...
> 	out_unlock:
> 		mutex_unlock(&rdtgroup_mutex);
> 		cpus_read_unlock();
> 	}
> 
> 
> I am not clear on whether there are more than one race here now. From the flow
> you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
> it can assume that its associated domain has not been removed and thus that it is
> running with a valid work_struct? More specifically, if is_percpu_thread() is true
> then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work? 
> I am trying to match this race involving the CPU hotplug lock with the race described
> in original changelog that involves rdtgroup_mutex here ...
> 
> > 
> > Maybe it doesn't hurt for it to perform an extra set of mbm_update()
> > calls on that CPU?
> 
> The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
> One possible issue is the impact on software controller that assumes it is being
> called once per second. Looks like an extra call can be avoided though?
> 
> > 
> > It will then do:
> > 
> > 	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
> > 						   RESCTRL_PICK_ANY_CPU);
> > 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
> > 
> > This "wrong" domain already has a worker ... Will this just reset the
> > timeout to the new "delay" value? Possibly also to a different CPU?
> 
> queue_delayed_work_on()'s comments mention "Return: %false if @work was
> already on a queue" which I interpret as existing (correct) worker not being
> impacted. I am not familiar with these corner cases though.

Yes. Existing worker is not impacted. Which causes the problem described above.

> Reinette

-Tony

      reply	other threads:[~2026-05-05 23:07 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
2026-05-04 15:11 ` Reinette Chatre
2026-05-04 22:50   ` Luck, Tony
2026-05-05  4:39     ` Reinette Chatre
2026-05-05 16:45       ` Luck, Tony
2026-05-05 21:26         ` Reinette Chatre
2026-05-05 23:07           ` Luck, Tony [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afp4OkwtPpyIwAC1@agluck-desk3 \
    --to=tony.luck@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=babu.moger@amd.com \
    --cc=bp@alien8.de \
    --cc=dfustini@baylibre.com \
    --cc=fenghuay@nvidia.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=peternewman@google.com \
    --cc=reinette.chatre@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox