From: "Luck, Tony" <tony.luck@intel.com>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: Borislav Petkov <bp@alien8.de>, <x86@kernel.org>,
Fenghua Yu <fenghuay@nvidia.com>,
Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
Peter Newman <peternewman@google.com>,
James Morse <james.morse@arm.com>,
Babu Moger <babu.moger@amd.com>,
"Drew Fustini" <dfustini@baylibre.com>,
Dave Martin <Dave.Martin@arm.com>, Chen Yu <yu.c.chen@intel.com>,
<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
Date: Tue, 5 May 2026 16:07:38 -0700 [thread overview]
Message-ID: <afp4OkwtPpyIwAC1@agluck-desk3> (raw)
In-Reply-To: <7fad1d7d-c892-416e-b97a-a230fd43f2a4@intel.com>
On Tue, May 05, 2026 at 02:26:11PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/5/26 9:45 AM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
>
> ...
> > Here's the scenario:
> >
> > There is just one CPU left in a domain. The mbm_over worker is woken on
> > that CPU just as user requests to take that CPU offline (executing on
> > another CPU).
> >
> > There is a race. mbm_handle_overflow() has begun execution, but the
> > offline process has taken cpus_write_lock() so it blocks and sleeps
> > waiting for cpus_read_lock().
> >
> > The offline process calls:
> >
> > resctrl_arch_offline_cpu()
> > -> resctrl_offline_cpu
> > -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
>
> Here I expect d->mbm_work_cpu to be set to this one CPU that is left in the domain.
>
> > -> mbm_setup_overflow_handler(d, 0, cpu);
> > Finds there are other CPUs in the domain
>
> I do not think that this will find other CPUs in the domain here since
> the scenario starts with there being only one CPU left in the domain and it is
> currently running the overflow handler . From what I can tell, in this scenario,
> mbm_setup_overflow_handler() will return without scheduling the overflow handler
> on any CPU.
>
> After mbm_setup_overflow_handler() returns I expect d->mbm_work_cpu to be set to
> nr_cpus_ids.
>
> This detail does not impact the race you are highlighting though.
>
> > -> domain_remove_cpu()
> > -> domain_remove_cpu_mon()
> > clears bit for this CPU from hdr->cpu_mask and finds mask is empty
> > -> resctrl_offline_mon_domain()
> > -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
> > -> domain_destroy_l3_mon_state()
> > -> kfree(d->mbm_states[idx]); /* file system state */
> > -> removes domain from L3 resource domain list
> > -> l3_mon_domain_free()
> > kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
> > kfree(hw_dom)
> >
> > Offline process continues. One of the things it does is checks for
> > orphans on the run queue of the now deceased CPU and adopts them.
> >
> > Finally the offline process completes and does cpus_write_unlock()
> >
> > Now mbm_handle_overflow() can continue. But it is on the wrong CPU
> > and has a pointer to a work_struct that was freed by the offline
> > process.
>
> Thank you for the explanation, I did not consider this scenario. From what I understand
> folks encountering this scenario should also encounter a splat from the MBM overflow
> handler from all the places where it runs smp_processor_id(). Well, actually, if these
> people are running with CONFIG_DEBUG_PREEMPT enabled because of the
> debug_smp_processor_id()->check_preemption_disabled()->is_percpu_thread() check.
>
> ...
>
> >> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> >> Here is the idea more specifically for the MBM overflow handler:
> >>
> >> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >> index 88f1fa0b9d8d..7186d6d02d6e 100644
> >> --- a/fs/resctrl/monitor.c
> >> +++ b/fs/resctrl/monitor.c
> >> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
> >> goto out_unlock;
> >>
> >> r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> >> - d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> >> + d = get_mon_domain_from_cpu(smp_processor_id(), r);
> >> + if (!d)
> >> + goto out_unlock;
> >
> > This will get a domain, but it will be for the wrong CPU.
>
> If it was migrated yes. From what I understand when this happens it stops being
> a "per-CPU thread" so how about something like:
>
> mbm_handle_overflow()
> {
>
> ...
> cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
>
> /*
> * Overflow handler migrated during race while CPU went offline and
> * no longer running on intended CPU.
> */
> if (!is_percpu_thread())
> goto unlock;
This is neat. It works well in the case with the above race on the last
CPU in a domain going offline. But I think there are problems if there
are other CPUs available, and user takes d->mbm_work_cpu or d->cqm_work_cpu
offline.
Here the corner case semantics of repeated calls to schedule_delayed_work_on()
are important. In this case resctrl_offline_cpu() will call:
cancel_delayed_work(&d->mbm_over); // Not sync, so running WQ keeps going
mbm_setup_overflow_handler(d, 0, cpu);
There are other CPUs available in the domain. Picks one:
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
This sees that work is already running and returns
%false without doing anything.
When offline of the CPU completes, the worker runs, finds it is no
longer a "is_percpu_thread()" and returns without scheduling future
execution. So checking for overflow on this domain is disabled.
> ...
> out_unlock:
> mutex_unlock(&rdtgroup_mutex);
> cpus_read_unlock();
> }
>
>
> I am not clear on whether there are more than one race here now. From the flow
> you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
> it can assume that its associated domain has not been removed and thus that it is
> running with a valid work_struct? More specifically, if is_percpu_thread() is true
> then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work?
> I am trying to match this race involving the CPU hotplug lock with the race described
> in original changelog that involves rdtgroup_mutex here ...
>
> >
> > Maybe it doesn't hurt for it to perform an extra set of mbm_update()
> > calls on that CPU?
>
> The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
> One possible issue is the impact on software controller that assumes it is being
> called once per second. Looks like an extra call can be avoided though?
>
> >
> > It will then do:
> >
> > d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
> > RESCTRL_PICK_ANY_CPU);
> > schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
> >
> > This "wrong" domain already has a worker ... Will this just reset the
> > timeout to the new "delay" value? Possibly also to a different CPU?
>
> queue_delayed_work_on()'s comments mention "Return: %false if @work was
> already on a queue" which I interpret as existing (correct) worker not being
> impacted. I am not familiar with these corner cases though.
Yes. Existing worker is not impacted. Which causes the problem described above.
> Reinette
-Tony
prev parent reply other threads:[~2026-05-05 23:07 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
2026-05-04 15:11 ` Reinette Chatre
2026-05-04 22:50 ` Luck, Tony
2026-05-05 4:39 ` Reinette Chatre
2026-05-05 16:45 ` Luck, Tony
2026-05-05 21:26 ` Reinette Chatre
2026-05-05 23:07 ` Luck, Tony [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afp4OkwtPpyIwAC1@agluck-desk3 \
--to=tony.luck@intel.com \
--cc=Dave.Martin@arm.com \
--cc=babu.moger@amd.com \
--cc=bp@alien8.de \
--cc=dfustini@baylibre.com \
--cc=fenghuay@nvidia.com \
--cc=james.morse@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maciej.wieczor-retman@intel.com \
--cc=patches@lists.linux.dev \
--cc=peternewman@google.com \
--cc=reinette.chatre@intel.com \
--cc=x86@kernel.org \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox