public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Luck, Tony" <tony.luck@intel.com>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: Borislav Petkov <bp@alien8.de>, <x86@kernel.org>,
	Fenghua Yu <fenghuay@nvidia.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	"Drew Fustini" <dfustini@baylibre.com>,
	Dave Martin <Dave.Martin@arm.com>, Chen Yu <yu.c.chen@intel.com>,
	<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
Date: Tue, 5 May 2026 09:45:38 -0700	[thread overview]
Message-ID: <afoesuWB8RezVLrN@agluck-desk3> (raw)
In-Reply-To: <3f13c7e4-3812-447d-8c42-b28fd6b9d0fa@intel.com>

On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 5/4/26 3:50 PM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
> >> On 5/1/26 2:36 PM, Tony Luck wrote:
> 
> >>> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >>> index 9fd901c78dc6..e68eec83306e 100644
> >>> --- a/fs/resctrl/monitor.c
> >>> +++ b/fs/resctrl/monitor.c
> >>> @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
> >>>  	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> >>>  	struct rdt_l3_mon_domain *d;
> >>>  
> >>> -	cpus_read_lock();
> >>>  	mutex_lock(&rdtgroup_mutex);
> >>>  
> >>>  	d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
> >>
> >> Since work always runs on a CPU belonging to the domain, could it be simpler to use
> >> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
> >> instead of the work contained in the domain struct? 
> > 
> > Is this true? When a CPU is taken offline Linux picks another CPU to run
> 
> These are workers supporting monitoring and they read RMIDs from where they are 
> running. They do not need to IPI a CPU in another domain to read the monitoring data
> of interest. Are you seeing this behave differently?
> 
> Theoretically resctrl could have workers run anywhere if the events being handled
> don't care which CPU the monitoring data is read from, but the current workers
> do not behave this way.
> 
> > any unexpired queued work. No guarantee that new CPU is in the same
> > domain. I think a robust solution is going to need a check that the
> 
> There are a couple of places where the work is scheduled. I understand the
> particular scenario you refer to to be the work done by resctrl_offline_cpu().
> Specifically,
> 
> 	resctrl_offline_cpu(unsigned int cpu /* CPU being offlined */)
> 	{
> 		struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> 		struct rdt_l3_mon_domain *d;
> 		...
> 		
> 		d = get_mon_domain_from_cpu(cpu, l3); /* d is domain to which CPU being offlined belongs */
> 		if (d) {
> 			if (/* overflow handler currently running on CPU being offlined */) {
> 				cancel_delayed_work(&d->mbm_over);
> 				mbm_setup_overflow_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
> 			}
> 			if (/* limbo handler currently running on CPU being offlined */) {
> 				cancel_delayed_work(&d->cqm_limbo);
> 				cqm_setup_limbo_handler(d /* domain to pick new CPU from */, 0, cpu /* CPU to exclude */);
> 			}
> 		}
> 	}
> 
> >From above I see that the new CPU being picked for the work *is* guaranteed to be in the
> same domain as the CPU being offlined. What am I missing?

Here's the scenario:

There is just one CPU left in a domain. The mbm_over worker is woken on
that CPU just as user requests to take that CPU offline (executing on
another CPU).

There is a race.  mbm_handle_overflow() has begun execution, but the
offline process has taken cpus_write_lock() so it blocks and sleeps
waiting for cpus_read_lock().

The offline process calls:

resctrl_arch_offline_cpu()
-> resctrl_offline_cpu
    -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
    -> mbm_setup_overflow_handler(d, 0, cpu);
       Finds there are other CPUs in the domain
-> domain_remove_cpu()
   -> domain_remove_cpu_mon()
      clears bit for this CPU from hdr->cpu_mask and finds mask is empty
      -> resctrl_offline_mon_domain()
         -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
         -> domain_destroy_l3_mon_state()
            -> kfree(d->mbm_states[idx]); /* file system state */
      -> removes domain from L3 resource domain list
      -> l3_mon_domain_free()
         kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
         kfree(hw_dom)

Offline process continues. One of the things it does is checks for
orphans on the run queue of the now deceased CPU and adopts them.

Finally the offline process completes and does cpus_write_unlock()

Now mbm_handle_overflow() can continue. But it is on the wrong CPU
and has a pointer to a work_struct that was freed by the offline
process.

> > delayed work handlers are on the right domain. It looks like existing
> > code doesn't handle this well.
> 
> The work is also scheduled from resctrl_online_mon_domain() and then re-scheduled from
> the workers themselves, cqm_handle_limbo() and mbm_handle_overflow(). In all cases the
> workers stay in their respective domains.
> 
> Could you please elaborate how you find that the existing code does not handle this well?
> >> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
> >> stores the domain ID in its state instead of a pointer to the domain and then uses
> >> resctrl_find_domain() to find domain.
> >>
> >>>  
> >>> +	/*  If this domain is being deleted this work no longer needs to run. */
> >>> +	if (d->offlining)
> >>> +		goto out_unlock;
> >>> +
> > 
> > Claude seemed quite confident about removal of cpus_read_lock() in these
> > functions. Sashiko is confident that this has opened up several new race
> > conditions:
> 
> Domains are stored in a RCU list so if cpus_read_lock() is not possible they can also be
> accessed from a RCU read-side critical section for which x86 do not have an example at
> this time. This would look something like:
> 	rcu_read_lock();
> 	list_for_each_entry_rcu() {
> 	}
> 	rcu_read_lock();
> 
> The definition of domain_list_lock within arch/x86/kernel/cpu/resctrl/core.c has a nice
> writeup created by James about the locking requirements.
> 
> I do not think cpus_read_lock() should be removed. Especially not since my suggestion is
> to actually traverse the domain list using get_mon_domain_from_cpu() that will complain
> loudly if the CPU hotplug lock is not held.

Agreed. This was a bad choice by Claude, and I fell for its reasoning.

> > 
> > https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com
> > 
> > Maybe we need to add a reference count to the rdt_l3_mon_domain
> > structure and delay freeing it until the last user is gone?
> 
> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> Here is the idea more specifically for the MBM overflow handler:
> 
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 88f1fa0b9d8d..7186d6d02d6e 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
>  		goto out_unlock;
>  
>  	r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> -	d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> +	d = get_mon_domain_from_cpu(smp_processor_id(), r);
> +	if (!d)
> +		goto out_unlock;

This will get a domain, but it will be for the wrong CPU.

Maybe it doesn't hurt for it to perform an extra set of mbm_update()
calls on that CPU?

It will then do:

	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
						   RESCTRL_PICK_ANY_CPU);
	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);

This "wrong" domain already has a worker ... Will this just reset the
timeout to the new "delay" value? Possibly also to a different CPU?

>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
>  		mbm_update(r, d, prgrp);
> 
> update_mba_bw() is run from mbm_handle_overflow() and uses this same pattern
> to get the MBA control domain which I find to support that the context is safe.
> 
> Reinette

-Tony

      reply	other threads:[~2026-05-05 16:45 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01 21:36 [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain() Tony Luck
2026-05-04 15:11 ` Reinette Chatre
2026-05-04 22:50   ` Luck, Tony
2026-05-05  4:39     ` Reinette Chatre
2026-05-05 16:45       ` Luck, Tony [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afoesuWB8RezVLrN@agluck-desk3 \
    --to=tony.luck@intel.com \
    --cc=Dave.Martin@arm.com \
    --cc=babu.moger@amd.com \
    --cc=bp@alien8.de \
    --cc=dfustini@baylibre.com \
    --cc=fenghuay@nvidia.com \
    --cc=james.morse@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=patches@lists.linux.dev \
    --cc=peternewman@google.com \
    --cc=reinette.chatre@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox