Re: [PATCH 5/7] perf/x86/intel/uncore: Introduce PMU flags and broken state

Linux Perf Users
 help / color / mirror / Atom feed

From: "Chen, Zide" <zide.chen@intel.com>
To: sashiko-reviews@lists.linux.dev
Cc: linux-perf-users@vger.kernel.org
Subject: Re: [PATCH 5/7] perf/x86/intel/uncore: Introduce PMU flags and broken state
Date: Fri, 22 May 2026 14:24:05 -0500	[thread overview]
Message-ID: <abd37e83-82f3-4d39-a3df-3ce76406fdbe@intel.com> (raw)
In-Reply-To: <20260514042738.80B40C2BCB7@smtp.kernel.org>



On 5/13/2026 11:27 PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
> - [Critical] Intentionally leaking `die_refcnt` on PCI box setup failure prevents PMU unregistration, leading to a Use-After-Free.
> - [High] Global `PMU_BROKEN` state disables new event creation and breaks CPU context migration for fully functional dies.
> - [Medium] Confused lifecycle assumptions lead to contradictory error handling and dead code between PCI and MSR/MMIO PMUs.
> - [Medium] Data race between plain store and atomic bitwise operation on `pmu->flags`.
> --
> 
> commit 553c891e098baf8dc3f2b474849cc4642dfe9be6
> Author: Zide Chen <zide.chen@intel.com>
> 
> perf/x86/intel/uncore: Introduce PMU flags and broken state
> 
> This commit replaces the 'registered' field in intel_uncore_pmu with a
> 'flags' field, introducing a PMU_BROKEN flag to track box setup failures.
> If a box fails to initialize, the PMU is marked broken, which rejects new
> event assignments and skips future box setup attempts.
> 
>> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
>> index 617a68c1d76c3..4d31dcaf5405b 100644
>> --- a/arch/x86/events/intel/uncore.c
>> +++ b/arch/x86/events/intel/uncore.c
> 
> [ ... ]
> 
>> @@ -738,7 +738,7 @@ static int uncore_pmu_event_init(struct perf_event *event)
>>  
>>  	pmu = uncore_event_to_pmu(event);
>>  	/* no device found for this pmu */
>> -	if (!pmu->registered)
>> +	if (!uncore_pmu_available(pmu))
>>  		return -ENOENT;
> 
> Since PMU_BROKEN is a global flag across all dies and packages, if one die
> fails initialization, the entire PMU is marked broken. Does this cause
> uncore_pmu_event_init() to incorrectly reject new event assignments on fully
> functional packages?

This is by design. The PMU represents a single logical device spanning
all dies, so if any box fails initialization, the entire PMU is marked
broken to avoid partial and potentially inconsistent monitoring state.
Allowing some dies to remain active while others are broken could lead
to misleading perf data.

> 
> [ ... ]
> 
>> @@ -934,16 +934,16 @@ static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
>>  
>>  	ret = perf_pmu_register(&pmu->pmu, pmu->name, -1);
>>  	if (!ret)
>> -		pmu->registered = true;
>> +		uncore_pmu_set_registered(pmu);
>>  	return ret;
>>  }
>>  
>>  static void uncore_pmu_unregister(struct intel_uncore_pmu *pmu)
>>  {
>> -	if (!pmu->registered)
>> +	if (!uncore_pmu_registered(pmu))
>>  		return;
>>  	perf_pmu_unregister(&pmu->pmu);
>> -	pmu->registered = false;
>> +	WRITE_ONCE(pmu->flags, 0);
> 
> Can this plain store data race with concurrent atomic bitwise operations?
> 
> If uncore_box_setup() executes on another CPU (e.g., probing a new failing
> die while removing a successful die), it uses the atomic read-modify-write
> instruction set_bit(PMU_BROKEN_BIT, &(pmu)->flags). Mixing plain stores and
> atomic bitwise operations on the same memory location could lead to
> clobbered updates.

Actually, I'm thinking the atomic operations are not needed at all. On
the MSR/MMIO side, CPU hotplug callbacks are serialized, so plain bit
operations are safe. For PCI PMUs, the PCI device is per-die, so there
is no way that box_setup() and pmu_unregister() will happen concurrently.

> [ ... ]
> 
>> @@ -1136,7 +1136,13 @@ static int uncore_box_setup(struct intel_uncore_pmu *pmu,
>>  
>>  	/* die_refcnt tracks online dies, not only functioning boxes. */
>>  	dies = atomic_inc_return(&pmu->die_refcnt);
>> -	uncore_box_init(box);
>> +
>> +	if (uncore_pmu_broken(pmu))
>> +		return -ENODEV;
>> +
>> +	ret = uncore_box_init(box);
>> +	if (ret)
>> +		goto err;
>>  
>>  	/* First active box registers the pmu. */
>>  	if (dies > 1)
>> @@ -1148,6 +1154,19 @@ static int uncore_box_setup(struct intel_uncore_pmu *pmu,
>>  
>>  	return 0;
>>  err:
>> +	/*
>> +	 * On failure on any box, mark the per-package PMU as broken regardless
>> +	 * of whether it was registered or not.
>> +	 *
>> +	 * Don't decrement die_refcnt to prevent any future CPU online
>> +	 * event or PCI probe, from retrying the failed PMU registration.
>> +	 *
>> +	 * Don't decrement cpu_refcnt to avoid other in-die CPUs from
>> +	 * trying to set up the PMU box again.
>> +	 *
>> +	 * Don't kfree box; MSR and MMIO boxes are freed at module exit only.
>> +	 */
> 
> Is there a risk of a use-after-free resulting from intentionally leaking
> die_refcnt here?
> 
> For PCI PMUs, the caller uncore_pci_pmu_register() executes kfree(box) on error
> and fails the PCI probe, meaning uncore_pci_remove() is never called to clean
> up this leaked reference.

You are right — in that case, uncore_pci_pmu_unregister() may not be
called if the box is NULL in uncore_pci_remove(). That said, die_refcnt
should be incremented only when pmu->boxes[die] is not NULL, for both
the PCI and MSR/MMIO cases.

In this patch, die_refcnt should be decremented if uncore_box_setup()
fails in the PCI case. There is also a pre-existing bug where die_refcnt
could be unexpectedly decremented via the uncore_bus_notify() path.


> If another successfully probed PCI box for this PMU is later removed,
> uncore_pci_pmu_unregister() decrements die_refcnt, but it will not reach 0 due
> to the earlier leak. Consequently, uncore_pmu_unregister(pmu) is skipped,
> leaving the PMU and its associated perf events active.
> 
> When uncore_pci_pmu_unregister() then frees the successful box, active
> perf_event instances could still hold event->pmu_private pointing to this freed
> box, potentially triggering a use-after-free on subsequent event reads or
> context switches.
> 
> Additionally, this comment claims MSR and MMIO boxes are freed at module exit
> only. Does this conflict with PCI PMUs which unconditionally execute kfree(box)
> on error, immediately destroying the box?

There is no conflicts here.
>> +	uncore_pmu_set_broken(pmu);
>>  	uncore_box_exit(box);
>>  	return ret;
>>  }
> 
> [ ... ]
> 
>> @@ -1483,7 +1502,8 @@ static void uncore_change_type_ctx(struct intel_uncore_type *type, int old_cpu,
>>  
>>  		if (old_cpu < 0) {
>>  			WARN_ON_ONCE(box->cpu != -1);
>> -			if (uncore_die_has_box(type, die, pmu->pmu_idx)) {
>> +			if (uncore_die_has_box(type, die, pmu->pmu_idx) &&
>> +			    !uncore_pmu_broken(pmu)) {
>>  				box->cpu = new_cpu;
>>  				cpumask_set_cpu(new_cpu, &pmu->cpu_mask);
>>  			}
>> @@ -1493,7 +1513,7 @@ static void uncore_change_type_ctx(struct intel_uncore_type *type, int old_cpu,
>>  		WARN_ON_ONCE(box->cpu != -1 && box->cpu != old_cpu);
>>  		box->cpu = -1;
>>  		cpumask_clear_cpu(old_cpu, &pmu->cpu_mask);
>> -		if (new_cpu < 0)
>> +		if (new_cpu < 0 || uncore_pmu_broken(pmu))
>>  			continue;
> 
> If uncore_pmu_broken(pmu) causes this to skip, does this strand active hardware
> events on offlined CPUs for fully functional dies?
> 
> The commit message states the goal is to avoid disrupting in-flight events on
> other boxes, but skipping uncore_pmu_cancel_hrtimer() and
> perf_pmu_migrate_context() here seems to break CPU context migration for those
> events.

You are right, checking the broken flag here could disrupt in-flight
events. Instead, checking box->cpu == -1 is appropriate to skip broken
boxes.

@@ -1522,9 +1522,15 @@ static void uncore_change_type_ctx(struct
intel_uncore_type *type, int old_cpu,
                }

                WARN_ON_ONCE(box->cpu != -1 && box->cpu != old_cpu);

+               if (box->cpu == -1)
+                       continue;
+
                box->cpu = -1;
                cpumask_clear_cpu(old_cpu, &pmu->cpu_mask);
-               if (new_cpu < 0 || uncore_pmu_broken(pmu))
+               if (new_cpu < 0)
                        continue;

>>  
>>  		if (!uncore_die_has_box(type, die, pmu->pmu_idx))
> 
> [ ... ]
> 
>> @@ -1573,7 +1593,7 @@ static int allocate_boxes(struct intel_uncore_type **types,
>>  		type = *types;
>>  		pmu = type->pmus;
>>  		for (i = 0; i < type->num_boxes; i++, pmu++) {
>> -			if (pmu->boxes[die])
>> +			if (pmu->boxes[die] || uncore_pmu_broken(pmu))
>>  				continue;
> 
> Is this check dead code? allocate_boxes() is exclusively used by MSR/MMIO
> PMUs, but those PMUs ignore hardware initialization failures and never call
> uncore_pmu_set_broken(pmu).

Not quite true. MSR/MMIO PMUs do not ignore hardware initialization
failures — they call uncore_pmu_set_broken() in uncore_box_setup() when
setup fails.

next prev parent reply	other threads:[~2026-05-22 19:24 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 23:30 [PATCH 0/7] perf/x86/intel/uncore: PMU setup robustness fixes Zide Chen
2026-05-12 23:30 ` [PATCH 1/7] perf/x86/intel/uncore: Rename refcount fields and other cleanups Zide Chen
2026-05-13  0:26   ` Ian Rogers
2026-05-14  0:58   ` sashiko-bot
2026-05-14 19:06     ` Chen, Zide
2026-05-12 23:30 ` [PATCH 2/7] perf/x86/intel/uncore: Let init_box() callback report failures Zide Chen
2026-05-13  0:23   ` Ian Rogers
2026-05-14  2:14   ` sashiko-bot
2026-05-21 20:29     ` Chen, Zide
2026-05-12 23:30 ` [PATCH 3/7] perf/x86/intel/uncore: Keep PCI PMUs working when MMIO/MSR setup fails Zide Chen
2026-05-13  0:30   ` Ian Rogers
2026-05-12 23:30 ` [PATCH 4/7] perf/x86/intel/uncore: Factor out box setup code Zide Chen
2026-05-13  0:27   ` Ian Rogers
2026-05-14  3:34   ` sashiko-bot
2026-05-12 23:30 ` [PATCH 5/7] perf/x86/intel/uncore: Introduce PMU flags and broken state Zide Chen
2026-05-13  0:28   ` Ian Rogers
2026-05-14  4:27   ` sashiko-bot
2026-05-22 19:24     ` Chen, Zide [this message]
2026-05-12 23:30 ` [PATCH 6/7] perf/x86/intel/uncore: Fix uncore_box ref/unref ordering on CPU hotplug Zide Chen
2026-05-13  0:32   ` Ian Rogers
2026-05-13  8:59   ` Mi, Dapeng
2026-05-13 18:43     ` Chen, Zide
2026-05-14  5:12   ` sashiko-bot
2026-05-12 23:30 ` [PATCH 7/7] perf/x86/intel/uncore: Implement lazy setup for MSR/MMIO PMU Zide Chen
2026-05-13  0:34   ` Ian Rogers
2026-05-13  9:03   ` Mi, Dapeng
2026-05-13 16:47     ` Chen, Zide
2026-05-14  5:38   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abd37e83-82f3-4d39-a3df-3ce76406fdbe@intel.com \
    --to=zide.chen@intel.com \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox