Linux Perf Users
 help / color / mirror / Atom feed
From: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
To: "Chen, Zide" <zide.chen@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Namhyung Kim <namhyung@kernel.org>,
	Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Andi Kleen <ak@linux.intel.com>,
	Eranian Stephane <eranian@google.com>
Cc: linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org
Subject: Re: [PATCH V2 6/8] perf/x86/intel/uncore: Introduce PMU flags and broken state
Date: Thu, 4 Jun 2026 09:15:31 +0800	[thread overview]
Message-ID: <c0bb7cab-8921-4c25-8588-fb4b8261c866@linux.intel.com> (raw)
In-Reply-To: <194d0173-dd05-4392-9dfa-26175bc71fb4@intel.com>


On 6/3/2026 11:46 PM, Chen, Zide wrote:
>
> On 6/2/2026 9:13 PM, Mi, Dapeng wrote:
>> On 6/2/2026 1:01 AM, Zide Chen wrote:
>>> Replace the boolean 'registered' field in intel_uncore_pmu with an
>>> unsigned long 'flags' field, and add a PMU_BROKEN flag to track box
>>> setup failures. Broken flag is sticky, means it won't be cleared
>>> unless module reload or system reboot.
>>>
>>> Broken PMUs are skipped in the CPU hotplug and box allocation paths.
>>>
>>> When any box fails to initialize, the PMU is marked broken.  Broken
>>> PMUs reject new event assignments and skip future box setup attempts.
>>> If the PMU was already registered, it remains so to avoid disrupting
>>> in-flight events on other boxes.
>>>
>>> Signed-off-by: Zide Chen <zide.chen@intel.com>
>>> ---
>>> v2:
>>> - Make the broken flag sticky by clear_bit() in uncore_pmu_unregister()
>>> other than zeroing out pmu->flag.
>>> - In uncore_change_type_ctx(), don't stop CPU migration on broken PMU
>>>   for in-flight events.  (Sashiko).
>>> - Use box->cpu == -1 to identify inactive boxes that don't need
>>>   migration, no need to check uncore_die_has_box(), which is incomplete.
>>> ---
>>>  arch/x86/events/intel/uncore.c     | 44 ++++++++++++++++++++++--------
>>>  arch/x86/events/intel/uncore.h     | 13 ++++++++-
>>>  arch/x86/events/intel/uncore_snb.c |  2 +-
>>>  3 files changed, 46 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
>>> index 283e41933ba7..f2cb3fde2dda 100644
>>> --- a/arch/x86/events/intel/uncore.c
>>> +++ b/arch/x86/events/intel/uncore.c
>>> @@ -757,7 +757,7 @@ static int uncore_pmu_event_init(struct perf_event *event)
>>>  
>>>  	pmu = uncore_event_to_pmu(event);
>>>  	/* no device found for this pmu */
>>> -	if (!pmu->registered)
>>> +	if (!uncore_pmu_available(pmu))
>>>  		return -ENOENT;
>>>  
>>>  	/* Sampling not supported yet */
>>> @@ -953,16 +953,18 @@ static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
>>>  
>>>  	ret = perf_pmu_register(&pmu->pmu, pmu->name, -1);
>>>  	if (!ret)
>>> -		pmu->registered = true;
>>> +		uncore_pmu_set_registered(pmu);
>>>  	return ret;
>>>  }
>>>  
>>>  static void uncore_pmu_unregister(struct intel_uncore_pmu *pmu)
>>>  {
>>> -	if (!pmu->registered)
>>> +	if (!uncore_pmu_registered(pmu))
>>>  		return;
>>>  	perf_pmu_unregister(&pmu->pmu);
>>> -	pmu->registered = false;
>>> +
>>> +	/* Keep PMU_BROKEN_BIT stick. */
>>> +	uncore_pmu_clear_registered(pmu);
>> Why to set PMU_BROKEN_BIT to a stick bit and doesn't clear it when
>> unregistering the PMU? Per my understanding, there is no active boxes when
>> uncore_pmu_unregister() is called, seem all flags can be cleared. When the
>> new CPU or PCI device are hot plug-in again, it may be a new device or the
>> old issue which leads to the PMU can't be initialized has been fixed, the
>> uncore PMU can work correctly, the PMU_BROKEN_BIT should be set any more.
> Woops, a typo here: stick->sticky.
>
> Let's examine a theoretical flow for a particular PMU:
>
> PCI dev on die0 online: activeboxes=1, PMU registered
> PCI dev on die1 online: init_box failed, pmu->activeboxes=1, set broken
> PCI dev on die0 offline: activeboxes=0, PMU unregistered
>
> If the broken bit is cleared in uncore_pmu_unregister():
> PCI dev on die0 or die2 online: activeboxes=1, PMU registered

I see. 


>
> This is wrong, because the box for die1 is not working.
>
> In reality, box_allocate() or init_box() failures may not happen at all,
> and when they do, the system may be experiencing severe out-of-memory or
> other catastrophic conditions. Keeping the broken flag sticky should be
> fine.
>
> One alternative to avoid making it sticky is to use pmu->activeboxes as
> pmu->die_refcnt to track online dies, not necessarily active boxes, as
> v1 does.
>
> The problem with this approach is that it could cause a pmu->die_refcnt
> leak in the uncore_pci_remove() path: if init_box() fails but
> pmu->die_refcnt has already been incremented, there is no chance to call
> uncore_pci_pmu_unregister() because pci_get_drvdata() returns NULL.
>
>>>  }
>>>  
>>>  static void uncore_free_boxes(struct intel_uncore_pmu *pmu)
>>> @@ -1153,7 +1155,12 @@ static int uncore_box_setup(struct intel_uncore_pmu *pmu,
>>>  {
>>>  	int ret;
>>>  
>>> -	uncore_box_init(box);
>>> +	if (uncore_pmu_broken(pmu))
>>> +		return -ENODEV;
>>> +
>>> +	ret = uncore_box_init(box);
>>> +	if (ret)
>>> +		goto err;
>>>  
>>>  	/* First active box registers the pmu. */
>>>  	if (atomic_inc_return(&pmu->activeboxes) > 1)
>>> @@ -1167,6 +1174,16 @@ static int uncore_box_setup(struct intel_uncore_pmu *pmu,
>>>  
>>>  	return 0;
>>>  err:
>>> +	/*
>>> +	 * On failure on any box, mark the per-package PMU as broken regardless
>>> +	 * of whether it was registered or not.
>>> +	 *
>>> +	 * Don't decrement refcnt to avoid other in-die CPUs from trying to set
>>> +	 * up the PMU box again.
>>> +	 *
>>> +	 * Don't kfree box; MSR and MMIO boxes are freed at module exit only.
>>> +	 */
>>> +	uncore_pmu_set_broken(pmu);
>>>  	uncore_box_exit(box);
>>>  	return ret;
>>>  }
>>> @@ -1190,8 +1207,10 @@ static int uncore_pci_pmu_register(struct pci_dev *pdev,
>>>  		return -EINVAL;
>>>  
>>>  	box = uncore_alloc_box(type, NUMA_NO_NODE);
>>> -	if (!box)
>>> +	if (!box) {
>>> +		uncore_pmu_set_broken(pmu);
>>>  		return -ENOMEM;
>>> +	}
>>>  
>>>  	box->dieid = die;
>>>  	box->pci_dev = pdev;
>>> @@ -1504,7 +1523,8 @@ static void uncore_change_type_ctx(struct intel_uncore_type *type, int old_cpu,
>>>  
>>>  		if (old_cpu < 0) {
>>>  			WARN_ON_ONCE(box->cpu != -1);
>>> -			if (uncore_die_has_box(type, die, pmu->pmu_idx)) {
>>> +			if (uncore_die_has_box(type, die, pmu->pmu_idx) &&
>>> +			    !uncore_pmu_broken(pmu)) {
>>>  				box->cpu = new_cpu;
>>>  				cpumask_set_cpu(new_cpu, &pmu->cpu_mask);
>>>  			}
>>> @@ -1512,12 +1532,14 @@ static void uncore_change_type_ctx(struct intel_uncore_type *type, int old_cpu,
>>>  		}
>>>  
>>>  		WARN_ON_ONCE(box->cpu != -1 && box->cpu != old_cpu);
>>> -		box->cpu = -1;
>>>  		cpumask_clear_cpu(old_cpu, &pmu->cpu_mask);
>>> -		if (new_cpu < 0)
>>> +		if (new_cpu < 0) {
>>> +			box->cpu = -1;
>>>  			continue;
>>> +		}
>>>  
>>> -		if (!uncore_die_has_box(type, die, pmu->pmu_idx))
>>> +		/* non-active box doesn't need migration. */
>>> +		if (box->cpu == -1)
>>>  			continue;
>>>  		uncore_pmu_cancel_hrtimer(box);
>>>  		perf_pmu_migrate_context(&pmu->pmu, old_cpu, new_cpu);
>>> @@ -1593,7 +1615,7 @@ static int allocate_boxes(struct intel_uncore_type **types,
>>>  		type = *types;
>>>  		pmu = type->pmus;
>>>  		for (i = 0; i < type->num_boxes; i++, pmu++) {
>> IIUC, the "type->num_boxes" actually indicates the number of PMU  or the
>> number of box classes. Since currently there could be multiple boxes for
>> each kind of PMUs on the multiple dies device, the name "num_boxes" become
>> inaccurate and much misleading. could we change it to a more accurate name,
>> like "num_pmus" or "num_box_classes"?
> Yes, agree. Will do more clean up in the future.

Thanks.


>>> -			if (pmu->boxes[die])
>>> +			if (pmu->boxes[die] || uncore_pmu_broken(pmu))
>>>  				continue;
>>>  			box = uncore_alloc_box(type, cpu_to_node(cpu));
>>>  			if (!box)
>>> diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
>>> index d732b87be0a9..0adb477d9708 100644
>>> --- a/arch/x86/events/intel/uncore.h
>>> +++ b/arch/x86/events/intel/uncore.h
>>> @@ -146,13 +146,24 @@ struct intel_uncore_pmu {
>>>  	struct pmu			pmu;
>>>  	char				name[UNCORE_PMU_NAME_LEN];
>>>  	int				pmu_idx;
>>> -	bool				registered;
>>> +	unsigned long			flags;
>>>  	atomic_t			activeboxes;
>>>  	cpumask_t			cpu_mask;
>>>  	struct intel_uncore_type	*type;
>>>  	struct intel_uncore_box		**boxes;
>>>  };
>>>  
>>> +#define PMU_REGISTERED_BIT	0
>>> +#define PMU_BROKEN_BIT		1
>>> +
>>> +#define uncore_pmu_registered(pmu)	test_bit(PMU_REGISTERED_BIT, &(pmu)->flags)
>>> +#define uncore_pmu_broken(pmu)		test_bit(PMU_BROKEN_BIT, &(pmu)->flags)
>>> +#define uncore_pmu_available(pmu)	(uncore_pmu_registered(pmu) &&  \
>>> +					 !uncore_pmu_broken(pmu))
>>> +#define uncore_pmu_set_registered(pmu)	set_bit(PMU_REGISTERED_BIT, &(pmu)->flags)
>>> +#define uncore_pmu_set_broken(pmu)	set_bit(PMU_BROKEN_BIT, &(pmu)->flags)
>>> +#define uncore_pmu_clear_registered(pmu) clear_bit(PMU_REGISTERED_BIT, &(pmu)->flags)
>>> +
>>>  struct intel_uncore_extra_reg {
>>>  	raw_spinlock_t lock;
>>>  	u64 config, config1, config2;
>>> diff --git a/arch/x86/events/intel/uncore_snb.c b/arch/x86/events/intel/uncore_snb.c
>>> index c5347920541c..055131c508ff 100644
>>> --- a/arch/x86/events/intel/uncore_snb.c
>>> +++ b/arch/x86/events/intel/uncore_snb.c
>>> @@ -940,7 +940,7 @@ static int snb_uncore_imc_event_init(struct perf_event *event)
>>>  
>>>  	pmu = uncore_event_to_pmu(event);
>>>  	/* no device found for this pmu */
>>> -	if (!pmu->registered)
>>> +	if (!uncore_pmu_available(pmu))
>>>  		return -ENOENT;
>>>  
>>>  	/* Sampling not supported yet */

  reply	other threads:[~2026-06-04  1:15 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-01 17:01 [PATCH v2 0/8] perf/x86/intel/uncore: PMU setup robustness fixes Zide Chen
2026-06-01 17:01 ` [PATCH V2 1/8] perf/x86/intel/uncore: Fix PCI PMU cleanup on setup failure Zide Chen
2026-06-02  7:24   ` Mi, Dapeng
2026-06-01 17:01 ` [PATCH V2 2/8] perf/x86/intel/uncore: Fix refcnt and other cleanups Zide Chen
2026-06-02  9:52   ` Mi, Dapeng
2026-06-02 14:16     ` Chen, Zide
2026-06-03  1:13       ` Mi, Dapeng
2026-06-03 15:09         ` Chen, Zide
2026-06-04  1:00           ` Mi, Dapeng
2026-06-04 15:41             ` Chen, Zide
2026-06-05  0:30               ` Mi, Dapeng
2026-06-01 17:01 ` [PATCH V2 3/8] perf/x86/intel/uncore: Let init_box() callback report failures Zide Chen
2026-06-02  9:57   ` Mi, Dapeng
2026-06-01 17:01 ` [PATCH V2 4/8] perf/x86/intel/uncore: Keep PCI PMUs working when MMIO/MSR setup fails Zide Chen
2026-06-03  1:24   ` Mi, Dapeng
2026-06-01 17:01 ` [PATCH V2 5/8] perf/x86/intel/uncore: Factor out box setup code Zide Chen
2026-06-03  1:30   ` Mi, Dapeng
2026-06-01 17:01 ` [PATCH V2 6/8] perf/x86/intel/uncore: Introduce PMU flags and broken state Zide Chen
2026-06-03  2:13   ` Mi, Dapeng
2026-06-03 15:46     ` Chen, Zide
2026-06-04  1:15       ` Mi, Dapeng [this message]
2026-06-01 17:01 ` [PATCH V2 7/8] perf/x86/intel/uncore: Fix uncore_box ref/unref ordering on CPU hotplug Zide Chen
2026-06-03  2:32   ` Mi, Dapeng
2026-06-03 16:40     ` Chen, Zide
2026-06-04  1:16       ` Mi, Dapeng
2026-06-01 17:01 ` [PATCH V2 8/8] perf/x86/intel/uncore: Implement lazy setup for MSR/MMIO PMU Zide Chen
2026-06-03  2:43   ` Mi, Dapeng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0bb7cab-8921-4c25-8588-fb4b8261c866@linux.intel.com \
    --to=dapeng1.mi@linux.intel.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=eranian@google.com \
    --cc=irogers@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=zide.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox