public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable
@ 2025-02-05 10:21 Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 01/24] lockdep: Fix might_fault() Peter Zijlstra
                   ` (24 more replies)
  0 siblings, 25 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Hi

New version of the perf_pmu_unregister() rework. A fair number of bugs
squashed.  Much thanks to Ravi for writing a stress tester. The robots also
managed to find a few problems.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 01/24] lockdep: Fix might_fault()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-06 18:19   ` David Hildenbrand
  2025-02-05 10:21 ` [PATCH v2 02/24] perf: Ensure bpf_perf_link path is properly serialized Peter Zijlstra
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	David Hildenbrand

Turns out that commit 9ec23531fd48 ("sched/preempt, mm/fault: Trigger
might_sleep() in might_fault() with disabled pagefaults") accidentally
(and unnessecarily) put the lockdep part of __might_fault() under
CONFIG_DEBUG_ATOMIC_SLEEP.

Cc: David Hildenbrand <david@redhat.com>
Fixes: 9ec23531fd48 ("sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c |    2 --
 1 file changed, 2 deletions(-)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6695,10 +6695,8 @@ void __might_fault(const char *file, int
 	if (pagefault_disabled())
 		return;
 	__might_sleep(file, line);
-#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
 	if (current->mm)
 		might_lock_read(&current->mm->mmap_lock);
-#endif
 }
 EXPORT_SYMBOL(__might_fault);
 #endif



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 02/24] perf: Ensure bpf_perf_link path is properly serialized
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 01/24] lockdep: Fix might_fault() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 03/24] perf: Simplify child event tear-down Peter Zijlstra
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Ravi reported that the bpf_perf_link_attach() usage of
perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the
PERF_EVENT_IOC_SET_BPF case.

Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6001,6 +6001,9 @@ static int perf_event_set_output(struct
 static int perf_event_set_filter(struct perf_event *event, void __user *arg);
 static int perf_copy_attr(struct perf_event_attr __user *uattr,
 			  struct perf_event_attr *attr);
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+				     struct bpf_prog *prog,
+				     u64 bpf_cookie);
 
 static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg)
 {
@@ -6063,7 +6066,7 @@ static long _perf_ioctl(struct perf_even
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
 
-		err = perf_event_set_bpf_prog(event, prog, 0);
+		err = __perf_event_set_bpf_prog(event, prog, 0);
 		if (err) {
 			bpf_prog_put(prog);
 			return err;
@@ -10790,8 +10793,9 @@ static inline bool perf_event_is_tracing
 	return false;
 }
 
-int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
-			    u64 bpf_cookie)
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+				     struct bpf_prog *prog,
+				     u64 bpf_cookie)
 {
 	bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
 
@@ -10829,6 +10833,20 @@ int perf_event_set_bpf_prog(struct perf_
 	return perf_event_attach_bpf_prog(event, prog, bpf_cookie);
 }
 
+int perf_event_set_bpf_prog(struct perf_event *event,
+			    struct bpf_prog *prog,
+			    u64 bpf_cookie)
+{
+	struct perf_event_context *ctx;
+	int ret;
+
+	ctx = perf_event_ctx_lock(event);
+	ret = __perf_event_set_bpf_prog(event, prog, bpf_cookie);
+	perf_event_ctx_unlock(event, ctx);
+
+	return ret;
+}
+
 void perf_event_free_bpf_prog(struct perf_event *event)
 {
 	if (!perf_event_is_tracing(event)) {
@@ -10848,7 +10866,15 @@ static void perf_event_free_filter(struc
 {
 }
 
-int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+				     struct bpf_prog *prog,
+				     u64 bpf_cookie)
+{
+	return -ENOENT;
+}
+
+int perf_event_set_bpf_prog(struct perf_event *event,
+			    struct bpf_prog *prog,
 			    u64 bpf_cookie)
 {
 	return -ENOENT;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 03/24] perf: Simplify child event tear-down
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 01/24] lockdep: Fix might_fault() Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 02/24] perf: Ensure bpf_perf_link path is properly serialized Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 04/24] perf: Simplify perf_event_free_task() wait Peter Zijlstra
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Currently perf_event_release_kernel() will iterate the child events and attempt
tear-down. However, it removes them from the child_list using list_move(),
notably skipping the state management done by perf_child_detach().

Crucially, it fails to clear PERF_ATTACH_CHILD, which opens the door for a
concurrent perf_remove_from_context() to race.

This way child_list management stays fully serialized using child_mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5526,8 +5526,8 @@ int perf_event_release_kernel(struct per
 		tmp = list_first_entry_or_null(&event->child_list,
 					       struct perf_event, child_list);
 		if (tmp == child) {
-			perf_remove_from_context(child, DETACH_GROUP);
-			list_move(&child->child_list, &free_list);
+			perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
+			list_add(&child->child_list, &free_list);
 			/*
 			 * This matches the refcount bump in inherit_event();
 			 * this can't be the last reference.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 04/24] perf: Simplify perf_event_free_task() wait
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (2 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 03/24] perf: Simplify child event tear-down Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 05/24] perf: Simplify perf_event_release_kernel() Peter Zijlstra
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   32 ++++++++------------------------
 1 file changed, 8 insertions(+), 24 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1223,8 +1223,14 @@ static void put_ctx(struct perf_event_co
 	if (refcount_dec_and_test(&ctx->refcount)) {
 		if (ctx->parent_ctx)
 			put_ctx(ctx->parent_ctx);
-		if (ctx->task && ctx->task != TASK_TOMBSTONE)
-			put_task_struct(ctx->task);
+		if (ctx->task) {
+			if (ctx->task == TASK_TOMBSTONE) {
+				smp_mb(); /* pairs with wait_var_event() */
+				wake_up_var(&ctx->refcount);
+			} else {
+				put_task_struct(ctx->task);
+			}
+		}
 		call_rcu(&ctx->rcu_head, free_ctx);
 	}
 }
@@ -5492,8 +5498,6 @@ int perf_event_release_kernel(struct per
 again:
 	mutex_lock(&event->child_mutex);
 	list_for_each_entry(child, &event->child_list, child_list) {
-		void *var = NULL;
-
 		/*
 		 * Cannot change, child events are not migrated, see the
 		 * comment with perf_event_ctx_lock_nested().
@@ -5533,39 +5537,19 @@ int perf_event_release_kernel(struct per
 			 * this can't be the last reference.
 			 */
 			put_event(event);
-		} else {
-			var = &ctx->refcount;
 		}
 
 		mutex_unlock(&event->child_mutex);
 		mutex_unlock(&ctx->mutex);
 		put_ctx(ctx);
 
-		if (var) {
-			/*
-			 * If perf_event_free_task() has deleted all events from the
-			 * ctx while the child_mutex got released above, make sure to
-			 * notify about the preceding put_ctx().
-			 */
-			smp_mb(); /* pairs with wait_var_event() */
-			wake_up_var(var);
-		}
 		goto again;
 	}
 	mutex_unlock(&event->child_mutex);
 
 	list_for_each_entry_safe(child, tmp, &free_list, child_list) {
-		void *var = &child->ctx->refcount;
-
 		list_del(&child->child_list);
 		free_event(child);
-
-		/*
-		 * Wake any perf_event_free_task() waiting for this event to be
-		 * freed.
-		 */
-		smp_mb(); /* pairs with wait_var_event() */
-		wake_up_var(var);
 	}
 
 no_ctx:



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 05/24] perf: Simplify perf_event_release_kernel()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (3 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 04/24] perf: Simplify perf_event_free_task() wait Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 06/24] perf: Fix pmus_lock vs pmus_srcu ordering Peter Zijlstra
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Get rid of the free_list...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5462,7 +5462,6 @@ int perf_event_release_kernel(struct per
 {
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_event *child, *tmp;
-	LIST_HEAD(free_list);
 
 	/*
 	 * If we got here through err_alloc: free_event(event); we will not
@@ -5531,27 +5530,26 @@ int perf_event_release_kernel(struct per
 					       struct perf_event, child_list);
 		if (tmp == child) {
 			perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
-			list_add(&child->child_list, &free_list);
 			/*
 			 * This matches the refcount bump in inherit_event();
 			 * this can't be the last reference.
 			 */
 			put_event(event);
+		} else {
+			child = NULL;
 		}
 
 		mutex_unlock(&event->child_mutex);
 		mutex_unlock(&ctx->mutex);
+
+		if (child)
+			free_event(child);
 		put_ctx(ctx);
 
 		goto again;
 	}
 	mutex_unlock(&event->child_mutex);
 
-	list_for_each_entry_safe(child, tmp, &free_list, child_list) {
-		list_del(&child->child_list);
-		free_event(child);
-	}
-
 no_ctx:
 	put_event(event); /* Must be the 'last' reference */
 	return 0;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 06/24] perf: Fix pmus_lock vs pmus_srcu ordering
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (4 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 05/24] perf: Simplify perf_event_release_kernel() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-27 16:59   ` Lucas De Marchi
  2025-02-05 10:21 ` [PATCH v2 07/24] perf: Fix perf_pmu_register() vs perf_init_event() Peter Zijlstra
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Commit a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order")
placed pmus_lock inside pmus_srcu, this makes perf_pmu_unregister()
trip lockdep.

Move the locking about such that only pmu_idr and pmus (list) are
modified while holding pmus_lock. This avoids doing synchronize_srcu()
while holding pmus_lock and all is well again.

Fixes: a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11836,6 +11836,8 @@ void perf_pmu_unregister(struct pmu *pmu
 {
 	mutex_lock(&pmus_lock);
 	list_del_rcu(&pmu->entry);
+	idr_remove(&pmu_idr, pmu->type);
+	mutex_unlock(&pmus_lock);
 
 	/*
 	 * We dereference the pmu list under both SRCU and regular RCU, so
@@ -11845,7 +11847,6 @@ void perf_pmu_unregister(struct pmu *pmu
 	synchronize_rcu();
 
 	free_percpu(pmu->pmu_disable_count);
-	idr_remove(&pmu_idr, pmu->type);
 	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
 		if (pmu->nr_addr_filters)
 			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
@@ -11853,7 +11854,6 @@ void perf_pmu_unregister(struct pmu *pmu
 		put_device(pmu->dev);
 	}
 	free_pmu_context(pmu);
-	mutex_unlock(&pmus_lock);
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 07/24] perf: Fix perf_pmu_register() vs perf_init_event()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (5 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 06/24] perf: Fix pmus_lock vs pmus_srcu ordering Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 08/24] perf: Cleanup perf_try_init_event() Peter Zijlstra
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

There is a fairly obvious race between perf_init_event() doing
idr_find() and perf_pmu_register() doing idr_alloc() with an
incompletely initialized pmu pointer.

Avoid by doing idr_alloc() on a NULL pointer to register the id, and
swizzling the real pmu pointer at the end using idr_replace().

Also making sure to not set pmu members after publishing the pmu, duh.

[ introduce idr_cmpxchg() in order to better handle the idr_replace()
  error case -- if it were to return an unexpected pointer, it will
  already have replaced the value and there is no going back. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11739,6 +11739,21 @@ static int pmu_dev_alloc(struct pmu *pmu
 static struct lock_class_key cpuctx_mutex;
 static struct lock_class_key cpuctx_lock;
 
+static bool idr_cmpxchg(struct idr *idr, unsigned long id, void *old, void *new)
+{
+	void *tmp, *val = idr_find(idr, id);
+
+	if (val != old)
+		return false;
+
+	tmp = idr_replace(idr, new, id);
+	if (IS_ERR(tmp))
+		return false;
+
+	WARN_ON_ONCE(tmp != val);
+	return true;
+}
+
 int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 {
 	int cpu, ret, max = PERF_TYPE_MAX;
@@ -11765,7 +11780,7 @@ int perf_pmu_register(struct pmu *pmu, c
 	if (type >= 0)
 		max = type;
 
-	ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
+	ret = idr_alloc(&pmu_idr, NULL, max, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto free_pdc;
 
@@ -11773,6 +11788,7 @@ int perf_pmu_register(struct pmu *pmu, c
 
 	type = ret;
 	pmu->type = type;
+	atomic_set(&pmu->exclusive_cnt, 0);
 
 	if (pmu_bus_running && !pmu->dev) {
 		ret = pmu_dev_alloc(pmu);
@@ -11821,14 +11837,22 @@ int perf_pmu_register(struct pmu *pmu, c
 	if (!pmu->event_idx)
 		pmu->event_idx = perf_event_idx_default;
 
+	/*
+	 * Now that the PMU is complete, make it visible to perf_try_init_event().
+	 */
+	if (!idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu))
+		goto free_context;
 	list_add_rcu(&pmu->entry, &pmus);
-	atomic_set(&pmu->exclusive_cnt, 0);
+
 	ret = 0;
 unlock:
 	mutex_unlock(&pmus_lock);
 
 	return ret;
 
+free_context:
+	free_percpu(pmu->cpu_pmu_context);
+
 free_dev:
 	if (pmu->dev && pmu->dev != PMU_NULL_DEV) {
 		device_del(pmu->dev);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 08/24] perf: Cleanup perf_try_init_event()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (6 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 07/24] perf: Fix perf_pmu_register() vs perf_init_event() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-03-05 11:29   ` [tip: perf/core] perf/core: Clean up perf_try_init_event() tip-bot2 for Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 09/24] perf: Simplify perf_event_alloc() error path Peter Zijlstra
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Make sure that perf_try_init_event() doesn't leave event->pmu nor
event->destroy set on failure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   69 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 41 insertions(+), 28 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12019,38 +12019,51 @@ static int perf_try_init_event(struct pm
 	if (ctx)
 		perf_event_ctx_unlock(event->group_leader, ctx);
 
-	if (!ret) {
-		if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-		    has_extended_regs(event))
-			ret = -EOPNOTSUPP;
-
-		if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE &&
-		    event_has_any_exclude_flag(event))
-			ret = -EINVAL;
-
-		if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) {
-			const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu);
-			struct cpumask *pmu_cpumask = perf_scope_cpumask(pmu->scope);
-			int cpu;
-
-			if (pmu_cpumask && cpumask) {
-				cpu = cpumask_any_and(pmu_cpumask, cpumask);
-				if (cpu >= nr_cpu_ids)
-					ret = -ENODEV;
-				else
-					event->event_caps |= PERF_EV_CAP_READ_SCOPE;
-			} else {
-				ret = -ENODEV;
-			}
-		}
+	if (ret)
+		goto err_pmu;
 
-		if (ret && event->destroy)
-			event->destroy(event);
+	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
+	    has_extended_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
 	}
 
-	if (ret)
-		module_put(pmu->module);
+	if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE &&
+	    event_has_any_exclude_flag(event)) {
+		ret = -EINVAL;
+		goto err_destroy;
+	}
+
+	if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) {
+		const struct cpumask *cpumask;
+		struct cpumask *pmu_cpumask;
+		int cpu;
+
+		cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu);
+		pmu_cpumask = perf_scope_cpumask(pmu->scope);
+
+		ret = -ENODEV;
+		if (!pmu_cpumask || !cpumask)
+			goto err_destroy;
+
+		cpu = cpumask_any_and(pmu_cpumask, cpumask);
+		if (cpu >= nr_cpu_ids)
+			goto err_destroy;
+
+		event->event_caps |= PERF_EV_CAP_READ_SCOPE;
+	}
+
+	return 0;
+
+err_destroy:
+	if (event->destroy) {
+		event->destroy(event);
+		event->destroy = NULL;
+	}
 
+err_pmu:
+	event->pmu = NULL;
+	module_put(pmu->module);
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 09/24] perf: Simplify perf_event_alloc() error path
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (7 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 08/24] perf: Cleanup perf_try_init_event() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 10/24] perf: Simplify perf_pmu_register() " Peter Zijlstra
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

The error cleanup sequence in perf_event_alloc() is a subset of the
existing _free_event() function (it must of course be).

Split this out into __free_event() and simplify the error path.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |   16 +++--
 kernel/events/core.c       |  134 ++++++++++++++++++++++-----------------------
 2 files changed, 75 insertions(+), 75 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -673,13 +673,15 @@ struct swevent_hlist {
 	struct rcu_head			rcu_head;
 };
 
-#define PERF_ATTACH_CONTEXT	0x01
-#define PERF_ATTACH_GROUP	0x02
-#define PERF_ATTACH_TASK	0x04
-#define PERF_ATTACH_TASK_DATA	0x08
-#define PERF_ATTACH_ITRACE	0x10
-#define PERF_ATTACH_SCHED_CB	0x20
-#define PERF_ATTACH_CHILD	0x40
+#define PERF_ATTACH_CONTEXT	0x0001
+#define PERF_ATTACH_GROUP	0x0002
+#define PERF_ATTACH_TASK	0x0004
+#define PERF_ATTACH_TASK_DATA	0x0008
+#define PERF_ATTACH_ITRACE	0x0010
+#define PERF_ATTACH_SCHED_CB	0x0020
+#define PERF_ATTACH_CHILD	0x0040
+#define PERF_ATTACH_EXCLUSIVE	0x0080
+#define PERF_ATTACH_CALLCHAIN	0x0100
 
 struct bpf_prog;
 struct perf_cgroup;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5245,6 +5245,8 @@ static int exclusive_event_init(struct p
 			return -EBUSY;
 	}
 
+	event->attach_state |= PERF_ATTACH_EXCLUSIVE;
+
 	return 0;
 }
 
@@ -5252,14 +5254,13 @@ static void exclusive_event_destroy(stru
 {
 	struct pmu *pmu = event->pmu;
 
-	if (!is_exclusive_pmu(pmu))
-		return;
-
 	/* see comment in exclusive_event_init() */
 	if (event->attach_state & PERF_ATTACH_TASK)
 		atomic_dec(&pmu->exclusive_cnt);
 	else
 		atomic_inc(&pmu->exclusive_cnt);
+
+	event->attach_state &= ~PERF_ATTACH_EXCLUSIVE;
 }
 
 static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
@@ -5318,40 +5319,20 @@ static void perf_pending_task_sync(struc
 	rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
 }
 
-static void _free_event(struct perf_event *event)
+/* vs perf_event_alloc() error */
+static void __free_event(struct perf_event *event)
 {
-	irq_work_sync(&event->pending_irq);
-	irq_work_sync(&event->pending_disable_irq);
-	perf_pending_task_sync(event);
-
-	unaccount_event(event);
+	if (event->attach_state & PERF_ATTACH_CALLCHAIN)
+		put_callchain_buffers();
 
-	security_perf_event_free(event);
+	kfree(event->addr_filter_ranges);
 
-	if (event->rb) {
-		/*
-		 * Can happen when we close an event with re-directed output.
-		 *
-		 * Since we have a 0 refcount, perf_mmap_close() will skip
-		 * over us; possibly making our ring_buffer_put() the last.
-		 */
-		mutex_lock(&event->mmap_mutex);
-		ring_buffer_attach(event, NULL);
-		mutex_unlock(&event->mmap_mutex);
-	}
+	if (event->attach_state & PERF_ATTACH_EXCLUSIVE)
+		exclusive_event_destroy(event);
 
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
 
-	if (!event->parent) {
-		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
-			put_callchain_buffers();
-	}
-
-	perf_event_free_bpf_prog(event);
-	perf_addr_filters_splice(event, NULL);
-	kfree(event->addr_filter_ranges);
-
 	if (event->destroy)
 		event->destroy(event);
 
@@ -5362,22 +5343,58 @@ static void _free_event(struct perf_even
 	if (event->hw.target)
 		put_task_struct(event->hw.target);
 
-	if (event->pmu_ctx)
+	if (event->pmu_ctx) {
+		/*
+		 * put_pmu_ctx() needs an event->ctx reference, because of
+		 * epc->ctx.
+		 */
+		WARN_ON_ONCE(!event->ctx);
+		WARN_ON_ONCE(event->pmu_ctx->ctx != event->ctx);
 		put_pmu_ctx(event->pmu_ctx);
+	}
 
 	/*
-	 * perf_event_free_task() relies on put_ctx() being 'last', in particular
-	 * all task references must be cleaned up.
+	 * perf_event_free_task() relies on put_ctx() being 'last', in
+	 * particular all task references must be cleaned up.
 	 */
 	if (event->ctx)
 		put_ctx(event->ctx);
 
-	exclusive_event_destroy(event);
-	module_put(event->pmu->module);
+	if (event->pmu)
+		module_put(event->pmu->module);
 
 	call_rcu(&event->rcu_head, free_event_rcu);
 }
 
+/* vs perf_event_alloc() success */
+static void _free_event(struct perf_event *event)
+{
+	irq_work_sync(&event->pending_irq);
+	irq_work_sync(&event->pending_disable_irq);
+	perf_pending_task_sync(event);
+
+	unaccount_event(event);
+
+	security_perf_event_free(event);
+
+	if (event->rb) {
+		/*
+		 * Can happen when we close an event with re-directed output.
+		 *
+		 * Since we have a 0 refcount, perf_mmap_close() will skip
+		 * over us; possibly making our ring_buffer_put() the last.
+		 */
+		mutex_lock(&event->mmap_mutex);
+		ring_buffer_attach(event, NULL);
+		mutex_unlock(&event->mmap_mutex);
+	}
+
+	perf_event_free_bpf_prog(event);
+	perf_addr_filters_splice(event, NULL);
+
+	__free_event(event);
+}
+
 /*
  * Used to free events which have a known refcount of 1, such as in error paths
  * where the event isn't exposed yet and inherited events.
@@ -12390,7 +12407,7 @@ perf_event_alloc(struct perf_event_attr
 	 * See perf_output_read().
 	 */
 	if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID))
-		goto err_ns;
+		goto err;
 
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
@@ -12398,7 +12415,7 @@ perf_event_alloc(struct perf_event_attr
 	pmu = perf_init_event(event);
 	if (IS_ERR(pmu)) {
 		err = PTR_ERR(pmu);
-		goto err_ns;
+		goto err;
 	}
 
 	/*
@@ -12408,25 +12425,25 @@ perf_event_alloc(struct perf_event_attr
 	 */
 	if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) {
 		err = -EINVAL;
-		goto err_pmu;
+		goto err;
 	}
 
 	if (event->attr.aux_output &&
 	    (!(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT) ||
 	     event->attr.aux_pause || event->attr.aux_resume)) {
 		err = -EOPNOTSUPP;
-		goto err_pmu;
+		goto err;
 	}
 
 	if (event->attr.aux_pause && event->attr.aux_resume) {
 		err = -EINVAL;
-		goto err_pmu;
+		goto err;
 	}
 
 	if (event->attr.aux_start_paused) {
 		if (!(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE)) {
 			err = -EOPNOTSUPP;
-			goto err_pmu;
+			goto err;
 		}
 		event->hw.aux_paused = 1;
 	}
@@ -12434,12 +12451,12 @@ perf_event_alloc(struct perf_event_attr
 	if (cgroup_fd != -1) {
 		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
 		if (err)
-			goto err_pmu;
+			goto err;
 	}
 
 	err = exclusive_event_init(event);
 	if (err)
-		goto err_pmu;
+		goto err;
 
 	if (has_addr_filter(event)) {
 		event->addr_filter_ranges = kcalloc(pmu->nr_addr_filters,
@@ -12447,7 +12464,7 @@ perf_event_alloc(struct perf_event_attr
 						    GFP_KERNEL);
 		if (!event->addr_filter_ranges) {
 			err = -ENOMEM;
-			goto err_per_task;
+			goto err;
 		}
 
 		/*
@@ -12472,41 +12489,22 @@ perf_event_alloc(struct perf_event_attr
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
 			err = get_callchain_buffers(attr->sample_max_stack);
 			if (err)
-				goto err_addr_filters;
+				goto err;
+			event->attach_state |= PERF_ATTACH_CALLCHAIN;
 		}
 	}
 
 	err = security_perf_event_alloc(event);
 	if (err)
-		goto err_callchain_buffer;
+		goto err;
 
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
 	return event;
 
-err_callchain_buffer:
-	if (!event->parent) {
-		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
-			put_callchain_buffers();
-	}
-err_addr_filters:
-	kfree(event->addr_filter_ranges);
-
-err_per_task:
-	exclusive_event_destroy(event);
-
-err_pmu:
-	if (is_cgroup_event(event))
-		perf_detach_cgroup(event);
-	if (event->destroy)
-		event->destroy(event);
-	module_put(pmu->module);
-err_ns:
-	if (event->hw.target)
-		put_task_struct(event->hw.target);
-	call_rcu(&event->rcu_head, free_event_rcu);
-
+err:
+	__free_event(event);
 	return ERR_PTR(err);
 }
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 10/24] perf: Simplify perf_pmu_register() error path
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (8 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 09/24] perf: Simplify perf_event_alloc() error path Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 11/24] perf: Simplify perf_pmu_register() Peter Zijlstra
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

The error path of perf_pmu_register() is of course very similar to a
subset of perf_pmu_unregister(). Extract this common part in
perf_pmu_free() and simplify things.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   67 ++++++++++++++++++++++-----------------------------
 1 file changed, 30 insertions(+), 37 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11605,11 +11605,6 @@ static int perf_event_idx_default(struct
 	return 0;
 }
 
-static void free_pmu_context(struct pmu *pmu)
-{
-	free_percpu(pmu->cpu_pmu_context);
-}
-
 /*
  * Let userspace know that this PMU supports address range filtering:
  */
@@ -11815,6 +11810,7 @@ static int pmu_dev_alloc(struct pmu *pmu
 
 free_dev:
 	put_device(pmu->dev);
+	pmu->dev = NULL;
 	goto out;
 }
 
@@ -11836,25 +11832,38 @@ static bool idr_cmpxchg(struct idr *idr,
 	return true;
 }
 
+static void perf_pmu_free(struct pmu *pmu)
+{
+	free_percpu(pmu->pmu_disable_count);
+	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
+		if (pmu->nr_addr_filters)
+			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
+		device_del(pmu->dev);
+		put_device(pmu->dev);
+	}
+	free_percpu(pmu->cpu_pmu_context);
+}
+
 int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 {
 	int cpu, ret, max = PERF_TYPE_MAX;
 
+	pmu->type = -1;
+
 	mutex_lock(&pmus_lock);
 	ret = -ENOMEM;
 	pmu->pmu_disable_count = alloc_percpu(int);
 	if (!pmu->pmu_disable_count)
 		goto unlock;
 
-	pmu->type = -1;
 	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
 		ret = -EINVAL;
-		goto free_pdc;
+		goto free;
 	}
 
 	if (WARN_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE, "Can not register a pmu with an invalid scope.\n")) {
 		ret = -EINVAL;
-		goto free_pdc;
+		goto free;
 	}
 
 	pmu->name = name;
@@ -11864,24 +11873,23 @@ int perf_pmu_register(struct pmu *pmu, c
 
 	ret = idr_alloc(&pmu_idr, NULL, max, 0, GFP_KERNEL);
 	if (ret < 0)
-		goto free_pdc;
+		goto free;
 
 	WARN_ON(type >= 0 && ret != type);
 
-	type = ret;
-	pmu->type = type;
+	pmu->type = ret;
 	atomic_set(&pmu->exclusive_cnt, 0);
 
 	if (pmu_bus_running && !pmu->dev) {
 		ret = pmu_dev_alloc(pmu);
 		if (ret)
-			goto free_idr;
+			goto free;
 	}
 
 	ret = -ENOMEM;
 	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
 	if (!pmu->cpu_pmu_context)
-		goto free_dev;
+		goto free;
 
 	for_each_possible_cpu(cpu) {
 		struct perf_cpu_pmu_context *cpc;
@@ -11922,8 +11930,10 @@ int perf_pmu_register(struct pmu *pmu, c
 	/*
 	 * Now that the PMU is complete, make it visible to perf_try_init_event().
 	 */
-	if (!idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu))
-		goto free_context;
+	if (!idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu)) {
+		ret = -EINVAL;
+		goto free;
+	}
 	list_add_rcu(&pmu->entry, &pmus);
 
 	ret = 0;
@@ -11932,20 +11942,10 @@ int perf_pmu_register(struct pmu *pmu, c
 
 	return ret;
 
-free_context:
-	free_percpu(pmu->cpu_pmu_context);
-
-free_dev:
-	if (pmu->dev && pmu->dev != PMU_NULL_DEV) {
-		device_del(pmu->dev);
-		put_device(pmu->dev);
-	}
-
-free_idr:
-	idr_remove(&pmu_idr, pmu->type);
-
-free_pdc:
-	free_percpu(pmu->pmu_disable_count);
+free:
+	if (pmu->type >= 0)
+		idr_remove(&pmu_idr, pmu->type);
+	perf_pmu_free(pmu);
 	goto unlock;
 }
 EXPORT_SYMBOL_GPL(perf_pmu_register);
@@ -11964,14 +11964,7 @@ void perf_pmu_unregister(struct pmu *pmu
 	synchronize_srcu(&pmus_srcu);
 	synchronize_rcu();
 
-	free_percpu(pmu->pmu_disable_count);
-	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
-		if (pmu->nr_addr_filters)
-			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
-		device_del(pmu->dev);
-		put_device(pmu->dev);
-	}
-	free_pmu_context(pmu);
+	perf_pmu_free(pmu);
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 11/24] perf: Simplify perf_pmu_register()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (9 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 10/24] perf: Simplify perf_pmu_register() " Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 12/24] perf: Simplify perf_init_event() Peter Zijlstra
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Using the previously introduced perf_pmu_free() and a new IDR helper,
simplify the perf_pmu_register error paths.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/idr.h  |   17 ++++++++++++
 kernel/events/core.c |   71 ++++++++++++++++++++-------------------------------
 2 files changed, 46 insertions(+), 42 deletions(-)

--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -15,6 +15,7 @@
 #include <linux/radix-tree.h>
 #include <linux/gfp.h>
 #include <linux/percpu.h>
+#include <linux/cleanup.h>
 
 struct idr {
 	struct radix_tree_root	idr_rt;
@@ -124,6 +125,22 @@ void *idr_get_next_ul(struct idr *, unsi
 void *idr_replace(struct idr *, void *, unsigned long id);
 void idr_destroy(struct idr *);
 
+struct __class_idr {
+	struct idr *idr;
+	int id;
+};
+
+#define idr_null ((struct __class_idr){ NULL, -1 })
+#define take_idr_id(id) __get_and_null(id, idr_null)
+
+DEFINE_CLASS(idr_alloc, struct __class_idr,
+	     if (_T.id >= 0) idr_remove(_T.idr, _T.id),
+	     ((struct __class_idr){
+	     	.idr = idr,
+		.id = idr_alloc(idr, ptr, start, end, gfp),
+	     }),
+	     struct idr *idr, void *ptr, int start, int end, gfp_t gfp);
+
 /**
  * idr_init_base() - Initialise an IDR.
  * @idr: IDR handle.
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11778,52 +11778,49 @@ static void perf_pmu_free(struct pmu *pm
 	free_percpu(pmu->cpu_pmu_context);
 }
 
-int perf_pmu_register(struct pmu *pmu, const char *name, int type)
+DEFINE_FREE(pmu_unregister, struct pmu *, if (_T) perf_pmu_free(_T))
+
+int perf_pmu_register(struct pmu *_pmu, const char *name, int type)
 {
-	int cpu, ret, max = PERF_TYPE_MAX;
+	int cpu, max = PERF_TYPE_MAX;
 
-	pmu->type = -1;
+	struct pmu *pmu __free(pmu_unregister) = _pmu;
+	guard(mutex)(&pmus_lock);
 
-	mutex_lock(&pmus_lock);
-	ret = -ENOMEM;
 	pmu->pmu_disable_count = alloc_percpu(int);
 	if (!pmu->pmu_disable_count)
-		goto unlock;
+		return -ENOMEM;
 
-	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
-		ret = -EINVAL;
-		goto free;
-	}
+	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n"))
+		return -EINVAL;
 
-	if (WARN_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE, "Can not register a pmu with an invalid scope.\n")) {
-		ret = -EINVAL;
-		goto free;
-	}
+	if (WARN_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE,
+		      "Can not register a pmu with an invalid scope.\n"))
+		return -EINVAL;
 
 	pmu->name = name;
 
 	if (type >= 0)
 		max = type;
 
-	ret = idr_alloc(&pmu_idr, NULL, max, 0, GFP_KERNEL);
-	if (ret < 0)
-		goto free;
+	CLASS(idr_alloc, pmu_type)(&pmu_idr, NULL, max, 0, GFP_KERNEL);
+	if (pmu_type.id < 0)
+		return pmu_type.id;
 
-	WARN_ON(type >= 0 && ret != type);
+	WARN_ON(type >= 0 && pmu_type.id != type);
 
-	pmu->type = ret;
+	pmu->type = pmu_type.id;
 	atomic_set(&pmu->exclusive_cnt, 0);
 
 	if (pmu_bus_running && !pmu->dev) {
-		ret = pmu_dev_alloc(pmu);
+		int ret = pmu_dev_alloc(pmu);
 		if (ret)
-			goto free;
+			return ret;
 	}
 
-	ret = -ENOMEM;
 	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
 	if (!pmu->cpu_pmu_context)
-		goto free;
+		return -ENOMEM;
 
 	for_each_possible_cpu(cpu) {
 		struct perf_cpu_pmu_context *cpc;
@@ -11864,32 +11861,22 @@ int perf_pmu_register(struct pmu *pmu, c
 	/*
 	 * Now that the PMU is complete, make it visible to perf_try_init_event().
 	 */
-	if (!idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu)) {
-		ret = -EINVAL;
-		goto free;
-	}
+	if (!idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu))
+		return -EINVAL;
 	list_add_rcu(&pmu->entry, &pmus);
 
-	ret = 0;
-unlock:
-	mutex_unlock(&pmus_lock);
-
-	return ret;
-
-free:
-	if (pmu->type >= 0)
-		idr_remove(&pmu_idr, pmu->type);
-	perf_pmu_free(pmu);
-	goto unlock;
+	take_idr_id(pmu_type);
+	_pmu = no_free_ptr(pmu); // let it rip
+	return 0;
 }
 EXPORT_SYMBOL_GPL(perf_pmu_register);
 
 void perf_pmu_unregister(struct pmu *pmu)
 {
-	mutex_lock(&pmus_lock);
-	list_del_rcu(&pmu->entry);
-	idr_remove(&pmu_idr, pmu->type);
-	mutex_unlock(&pmus_lock);
+	scoped_guard (mutex, &pmus_lock) {
+		list_del_rcu(&pmu->entry);
+		idr_remove(&pmu_idr, pmu->type);
+	}
 
 	/*
 	 * We dereference the pmu list under both SRCU and regular RCU, so



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 12/24] perf: Simplify perf_init_event()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (10 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 11/24] perf: Simplify perf_pmu_register() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 13/24] perf: Simplify perf_event_alloc() Peter Zijlstra
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11942,10 +11942,10 @@ static int perf_try_init_event(struct pm
 static struct pmu *perf_init_event(struct perf_event *event)
 {
 	bool extended_type = false;
-	int idx, type, ret;
 	struct pmu *pmu;
+	int type, ret;
 
-	idx = srcu_read_lock(&pmus_srcu);
+	guard(srcu)(&pmus_srcu);
 
 	/*
 	 * Save original type before calling pmu->event_init() since certain
@@ -11958,7 +11958,7 @@ static struct pmu *perf_init_event(struc
 		pmu = event->parent->pmu;
 		ret = perf_try_init_event(pmu, event);
 		if (!ret)
-			goto unlock;
+			return pmu;
 	}
 
 	/*
@@ -11977,13 +11977,12 @@ static struct pmu *perf_init_event(struc
 	}
 
 again:
-	rcu_read_lock();
-	pmu = idr_find(&pmu_idr, type);
-	rcu_read_unlock();
+	scoped_guard (rcu)
+		pmu = idr_find(&pmu_idr, type);
 	if (pmu) {
 		if (event->attr.type != type && type != PERF_TYPE_RAW &&
 		    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
-			goto fail;
+			return ERR_PTR(-ENOENT);
 
 		ret = perf_try_init_event(pmu, event);
 		if (ret == -ENOENT && event->attr.type != type && !extended_type) {
@@ -11992,27 +11991,21 @@ static struct pmu *perf_init_event(struc
 		}
 
 		if (ret)
-			pmu = ERR_PTR(ret);
+			return ERR_PTR(ret);
 
-		goto unlock;
+		return pmu;
 	}
 
 	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {
 		ret = perf_try_init_event(pmu, event);
 		if (!ret)
-			goto unlock;
+			return pmu;
 
-		if (ret != -ENOENT) {
-			pmu = ERR_PTR(ret);
-			goto unlock;
-		}
+		if (ret != -ENOENT)
+			return ERR_PTR(ret);
 	}
-fail:
-	pmu = ERR_PTR(-ENOENT);
-unlock:
-	srcu_read_unlock(&pmus_srcu, idx);
 
-	return pmu;
+	return ERR_PTR(-ENOENT);
 }
 
 static void attach_sb_event(struct perf_event *event)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 13/24] perf: Simplify perf_event_alloc()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (11 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 12/24] perf: Simplify perf_init_event() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 14/24] perf: Merge pmu_disable_count into cpu_pmu_context Peter Zijlstra
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Using the previous simplifications, transition perf_event_alloc() to
the cleanup way of things -- reducing error path magic.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   60 +++++++++++++++++++--------------------------------
 1 file changed, 23 insertions(+), 37 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5360,6 +5361,8 @@ static void __free_event(struct perf_eve
 	call_rcu(&event->rcu_head, free_event_rcu);
 }
 
+DEFINE_FREE(__free_event, struct perf_event *, if (_T) __free_event(_T))
+
 /* vs perf_event_alloc() success */
 static void _free_event(struct perf_event *event)
 {
@@ -12238,7 +12241,6 @@ perf_event_alloc(struct perf_event_attr
 		 void *context, int cgroup_fd)
 {
 	struct pmu *pmu;
-	struct perf_event *event;
 	struct hw_perf_event *hwc;
 	long err = -EINVAL;
 	int node;
@@ -12253,8 +12255,8 @@ perf_event_alloc(struct perf_event_attr
 	}
 
 	node = (cpu >= 0) ? cpu_to_node(cpu) : -1;
-	event = kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO,
-				      node);
+	struct perf_event *event __free(__free_event) =
+		kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO, node);
 	if (!event)
 		return ERR_PTR(-ENOMEM);
 
@@ -12361,65 +12363,53 @@ perf_event_alloc(struct perf_event_attr
 	 * See perf_output_read().
 	 */
 	if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID))
-		goto err;
+		return ERR_PTR(-EINVAL);
 
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
 	pmu = perf_init_event(event);
-	if (IS_ERR(pmu)) {
-		err = PTR_ERR(pmu);
-		goto err;
-	}
+	if (IS_ERR(pmu))
+		return (void*)pmu;
 
 	/*
 	 * Disallow uncore-task events. Similarly, disallow uncore-cgroup
 	 * events (they don't make sense as the cgroup will be different
 	 * on other CPUs in the uncore mask).
 	 */
-	if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) {
-		err = -EINVAL;
-		goto err;
-	}
+	if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1))
+		return ERR_PTR(-EINVAL);
 
 	if (event->attr.aux_output &&
 	    (!(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT) ||
-	     event->attr.aux_pause || event->attr.aux_resume)) {
-		err = -EOPNOTSUPP;
-		goto err;
-	}
+	     event->attr.aux_pause || event->attr.aux_resume))
+		return ERR_PTR(-EOPNOTSUPP);
 
-	if (event->attr.aux_pause && event->attr.aux_resume) {
-		err = -EINVAL;
-		goto err;
-	}
+	if (event->attr.aux_pause && event->attr.aux_resume)
+		return ERR_PTR(-EINVAL);
 
 	if (event->attr.aux_start_paused) {
-		if (!(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE)) {
-			err = -EOPNOTSUPP;
-			goto err;
-		}
+		if (!(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE))
+			return ERR_PTR(-EOPNOTSUPP);
 		event->hw.aux_paused = 1;
 	}
 
 	if (cgroup_fd != -1) {
 		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
 		if (err)
-			goto err;
+			return ERR_PTR(err);
 	}
 
 	err = exclusive_event_init(event);
 	if (err)
-		goto err;
+		return ERR_PTR(err);
 
 	if (has_addr_filter(event)) {
 		event->addr_filter_ranges = kcalloc(pmu->nr_addr_filters,
 						    sizeof(struct perf_addr_filter_range),
 						    GFP_KERNEL);
-		if (!event->addr_filter_ranges) {
-			err = -ENOMEM;
-			goto err;
-		}
+		if (!event->addr_filter_ranges)
+			return ERR_PTR(-ENOMEM);
 
 		/*
 		 * Clone the parent's vma offsets: they are valid until exec()
@@ -12443,23 +12433,19 @@ perf_event_alloc(struct perf_event_attr
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
 			err = get_callchain_buffers(attr->sample_max_stack);
 			if (err)
-				goto err;
+				return ERR_PTR(err);
 			event->attach_state |= PERF_ATTACH_CALLCHAIN;
 		}
 	}
 
 	err = security_perf_event_alloc(event);
 	if (err)
-		goto err;
+		return ERR_PTR(err);
 
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
-	return event;
-
-err:
-	__free_event(event);
-	return ERR_PTR(err);
+	return_ptr(event);
 }
 
 static int perf_copy_attr(struct perf_event_attr __user *uattr,



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 14/24] perf: Merge pmu_disable_count into cpu_pmu_context
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (12 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 13/24] perf: Simplify perf_event_alloc() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 15/24] perf: Add this_cpc() helper Peter Zijlstra
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Because it makes no sense to have two per-cpu allocations per pmu.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |    2 +-
 kernel/events/core.c       |   12 ++++--------
 2 files changed, 5 insertions(+), 9 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -343,7 +343,6 @@ struct pmu {
 	 */
 	unsigned int			scope;
 
-	int __percpu			*pmu_disable_count;
 	struct perf_cpu_pmu_context __percpu *cpu_pmu_context;
 	atomic_t			exclusive_cnt; /* < 0: cpu; > 0: tsk */
 	int				task_ctx_nr;
@@ -1031,6 +1030,7 @@ struct perf_cpu_pmu_context {
 
 	int				active_oncpu;
 	int				exclusive;
+	int				pmu_disable_count;
 
 	raw_spinlock_t			hrtimer_lock;
 	struct hrtimer			hrtimer;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1174,21 +1174,22 @@ static int perf_mux_hrtimer_restart_ipi(
 
 void perf_pmu_disable(struct pmu *pmu)
 {
-	int *count = this_cpu_ptr(pmu->pmu_disable_count);
+	int *count = &this_cpu_ptr(pmu->cpu_pmu_context)->pmu_disable_count;
 	if (!(*count)++)
 		pmu->pmu_disable(pmu);
 }
 
 void perf_pmu_enable(struct pmu *pmu)
 {
-	int *count = this_cpu_ptr(pmu->pmu_disable_count);
+	int *count = &this_cpu_ptr(pmu->cpu_pmu_context)->pmu_disable_count;
 	if (!--(*count))
 		pmu->pmu_enable(pmu);
 }
 
 static void perf_assert_pmu_disabled(struct pmu *pmu)
 {
-	WARN_ON_ONCE(*this_cpu_ptr(pmu->pmu_disable_count) == 0);
+	int *count = &this_cpu_ptr(pmu->cpu_pmu_context)->pmu_disable_count;
+	WARN_ON_ONCE(*count == 0);
 }
 
 static inline void perf_pmu_read(struct perf_event *event)
@@ -11860,7 +11861,6 @@ static bool idr_cmpxchg(struct idr *idr,
 
 static void perf_pmu_free(struct pmu *pmu)
 {
-	free_percpu(pmu->pmu_disable_count);
 	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
 		if (pmu->nr_addr_filters)
 			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
@@ -11879,10 +11879,6 @@ int perf_pmu_register(struct pmu *_pmu,
 	struct pmu *pmu __free(pmu_unregister) = _pmu;
 	guard(mutex)(&pmus_lock);
 
-	pmu->pmu_disable_count = alloc_percpu(int);
-	if (!pmu->pmu_disable_count)
-		return -ENOMEM;
-
 	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n"))
 		return -EINVAL;
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 15/24] perf: Add this_cpc() helper
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (13 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 14/24] perf: Merge pmu_disable_count into cpu_pmu_context Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 16/24] perf: Detach perf_cpu_pmu_context and pmu lifetimes Peter Zijlstra
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

As a preparation for adding yet another indirection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1176,23 +1176,28 @@ static int perf_mux_hrtimer_restart_ipi(
 	return perf_mux_hrtimer_restart(arg);
 }
 
+static __always_inline struct perf_cpu_pmu_context *this_cpc(struct pmu *pmu)
+{
+	return this_cpu_ptr(pmu->cpu_pmu_context);
+}
+
 void perf_pmu_disable(struct pmu *pmu)
 {
-	int *count = &this_cpu_ptr(pmu->cpu_pmu_context)->pmu_disable_count;
+	int *count = &this_cpc(pmu)->pmu_disable_count;
 	if (!(*count)++)
 		pmu->pmu_disable(pmu);
 }
 
 void perf_pmu_enable(struct pmu *pmu)
 {
-	int *count = &this_cpu_ptr(pmu->cpu_pmu_context)->pmu_disable_count;
+	int *count = &this_cpc(pmu)->pmu_disable_count;
 	if (!--(*count))
 		pmu->pmu_enable(pmu);
 }
 
 static void perf_assert_pmu_disabled(struct pmu *pmu)
 {
-	int *count = &this_cpu_ptr(pmu->cpu_pmu_context)->pmu_disable_count;
+	int *count = &this_cpc(pmu)->pmu_disable_count;
 	WARN_ON_ONCE(*count == 0);
 }
 
@@ -2304,7 +2309,7 @@ static void
 event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
 {
 	struct perf_event_pmu_context *epc = event->pmu_ctx;
-	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpc(epc->pmu);
 	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
 
 	// XXX cpc serialization, probably per-cpu IRQ disabled
@@ -2445,9 +2450,8 @@ __perf_remove_from_context(struct perf_e
 		pmu_ctx->rotate_necessary = 0;
 
 		if (ctx->task && ctx->is_active) {
-			struct perf_cpu_pmu_context *cpc;
+			struct perf_cpu_pmu_context *cpc = this_cpc(pmu_ctx->pmu);
 
-			cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
 			WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
 			cpc->task_epc = NULL;
 		}
@@ -2585,7 +2589,7 @@ static int
 event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
 {
 	struct perf_event_pmu_context *epc = event->pmu_ctx;
-	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpc(epc->pmu);
 	int ret = 0;
 
 	WARN_ON_ONCE(event->ctx != ctx);
@@ -2692,7 +2696,7 @@ group_sched_in(struct perf_event *group_
 static int group_can_go_on(struct perf_event *event, int can_add_hw)
 {
 	struct perf_event_pmu_context *epc = event->pmu_ctx;
-	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpc(epc->pmu);
 
 	/*
 	 * Groups consisting entirely of software events can always go on.
@@ -3315,9 +3319,8 @@ static void __pmu_ctx_sched_out(struct p
 	struct pmu *pmu = pmu_ctx->pmu;
 
 	if (ctx->task && !(ctx->is_active & EVENT_ALL)) {
-		struct perf_cpu_pmu_context *cpc;
+		struct perf_cpu_pmu_context *cpc = this_cpc(pmu);
 
-		cpc = this_cpu_ptr(pmu->cpu_pmu_context);
 		WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
 		cpc->task_epc = NULL;
 	}
@@ -3565,7 +3568,7 @@ static void perf_ctx_sched_task_cb(struc
 	struct perf_cpu_pmu_context *cpc;
 
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+		cpc = this_cpc(pmu_ctx->pmu);
 
 		if (cpc->sched_cb_usage && pmu_ctx->pmu->sched_task)
 			pmu_ctx->pmu->sched_task(pmu_ctx, sched_in);
@@ -3674,7 +3677,7 @@ static DEFINE_PER_CPU(int, perf_sched_cb
 
 void perf_sched_cb_dec(struct pmu *pmu)
 {
-	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpc(pmu);
 
 	this_cpu_dec(perf_sched_cb_usages);
 	barrier();
@@ -3686,7 +3689,7 @@ void perf_sched_cb_dec(struct pmu *pmu)
 
 void perf_sched_cb_inc(struct pmu *pmu)
 {
-	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpc(pmu);
 
 	if (!cpc->sched_cb_usage++)
 		list_add(&cpc->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
@@ -3810,7 +3813,7 @@ static void __link_epc(struct perf_event
 	if (!pmu_ctx->ctx->task)
 		return;
 
-	cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+	cpc = this_cpc(pmu_ctx->pmu);
 	WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
 	cpc->task_epc = pmu_ctx;
 }
@@ -3939,10 +3942,9 @@ static int merge_sched_in(struct perf_ev
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 		} else {
-			struct perf_cpu_pmu_context *cpc;
+			struct perf_cpu_pmu_context *cpc = this_cpc(event->pmu_ctx->pmu);
 
 			event->pmu_ctx->rotate_necessary = 1;
-			cpc = this_cpu_ptr(event->pmu_ctx->pmu->cpu_pmu_context);
 			perf_mux_hrtimer_restart(cpc);
 			group_update_userpage(event);
 		}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 16/24] perf: Detach perf_cpu_pmu_context and pmu lifetimes
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (14 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 15/24] perf: Add this_cpc() helper Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 17/24] perf: Introduce perf_free_addr_filters() Peter Zijlstra
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

In prepration for being able to unregister a pmu with existing events,
it becomes important to detach struct perf_cpu_pmu_context lifetimes
from that of struct pmu.

Notably perf_cpu_pmu_context embeds a perf_event_pmu_context that can
stay referenced until the last event goes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |    4 +--
 kernel/events/core.c       |   56 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 49 insertions(+), 11 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -336,7 +336,7 @@ struct pmu {
 	 */
 	unsigned int			scope;
 
-	struct perf_cpu_pmu_context __percpu *cpu_pmu_context;
+	struct perf_cpu_pmu_context __percpu **cpu_pmu_context;
 	atomic_t			exclusive_cnt; /* < 0: cpu; > 0: tsk */
 	int				task_ctx_nr;
 	int				hrtimer_interval_ms;
@@ -901,7 +901,7 @@ struct perf_event_pmu_context {
 	struct list_head		pinned_active;
 	struct list_head		flexible_active;
 
-	/* Used to avoid freeing per-cpu perf_event_pmu_context */
+	/* Used to identify the per-cpu perf_event_pmu_context */
 	unsigned int			embedded : 1;
 
 	unsigned int			nr_events;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1178,7 +1178,7 @@ static int perf_mux_hrtimer_restart_ipi(
 
 static __always_inline struct perf_cpu_pmu_context *this_cpc(struct pmu *pmu)
 {
-	return this_cpu_ptr(pmu->cpu_pmu_context);
+	return *this_cpu_ptr(pmu->cpu_pmu_context);
 }
 
 void perf_pmu_disable(struct pmu *pmu)
@@ -4971,11 +4971,14 @@ find_get_pmu_context(struct pmu *pmu, st
 		 */
 		struct perf_cpu_pmu_context *cpc;
 
-		cpc = per_cpu_ptr(pmu->cpu_pmu_context, event->cpu);
+		cpc = *per_cpu_ptr(pmu->cpu_pmu_context, event->cpu);
 		epc = &cpc->epc;
 		raw_spin_lock_irq(&ctx->lock);
 		if (!epc->ctx) {
-			atomic_set(&epc->refcount, 1);
+			/*
+			 * One extra reference for the pmu; see perf_pmu_free().
+			 */
+			atomic_set(&epc->refcount, 2);
 			epc->embedded = 1;
 			list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
 			epc->ctx = ctx;
@@ -5044,6 +5047,15 @@ static void get_pmu_ctx(struct perf_even
 	WARN_ON_ONCE(!atomic_inc_not_zero(&epc->refcount));
 }
 
+static void free_cpc_rcu(struct rcu_head *head)
+{
+	struct perf_cpu_pmu_context *cpc =
+		container_of(head, typeof(*cpc), epc.rcu_head);
+
+	kfree(cpc->epc.task_ctx_data);
+	kfree(cpc);
+}
+
 static void free_epc_rcu(struct rcu_head *head)
 {
 	struct perf_event_pmu_context *epc = container_of(head, typeof(*epc), rcu_head);
@@ -5078,8 +5090,10 @@ static void put_pmu_ctx(struct perf_even
 
 	raw_spin_unlock_irqrestore(&ctx->lock, flags);
 
-	if (epc->embedded)
+	if (epc->embedded) {
+		call_rcu(&epc->rcu_head, free_cpc_rcu);
 		return;
+	}
 
 	call_rcu(&epc->rcu_head, free_epc_rcu);
 }
@@ -11595,7 +11609,7 @@ perf_event_mux_interval_ms_store(struct
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
 		struct perf_cpu_pmu_context *cpc;
-		cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+		cpc = *per_cpu_ptr(pmu->cpu_pmu_context, cpu);
 		cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
 
 		cpu_function_call(cpu, perf_mux_hrtimer_restart_ipi, cpc);
@@ -11767,7 +11781,25 @@ static void perf_pmu_free(struct pmu *pm
 		device_del(pmu->dev);
 		put_device(pmu->dev);
 	}
-	free_percpu(pmu->cpu_pmu_context);
+
+	if (pmu->cpu_pmu_context) {
+		int cpu;
+
+		for_each_possible_cpu(cpu) {
+			struct perf_cpu_pmu_context *cpc;
+
+			cpc = *per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+			if (!cpc)
+				continue;
+			if (cpc->epc.embedded) {
+				/* refcount managed */
+				put_pmu_ctx(&cpc->epc);
+				continue;
+			}
+			kfree(cpc);
+		}
+		free_percpu(pmu->cpu_pmu_context);
+	}
 }
 
 DEFINE_FREE(pmu_unregister, struct pmu *, if (_T) perf_pmu_free(_T))
@@ -11806,14 +11838,20 @@ int perf_pmu_register(struct pmu *_pmu,
 			return ret;
 	}
 
-	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
+	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context *);
 	if (!pmu->cpu_pmu_context)
 		return -ENOMEM;
 
 	for_each_possible_cpu(cpu) {
-		struct perf_cpu_pmu_context *cpc;
+		struct perf_cpu_pmu_context *cpc =
+			kmalloc_node(sizeof(struct perf_cpu_pmu_context),
+				     GFP_KERNEL | __GFP_ZERO,
+				     cpu_to_node(cpu));
+
+		if (!cpc)
+			return -ENOMEM;
 
-		cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+		*per_cpu_ptr(pmu->cpu_pmu_context, cpu) = cpc;
 		__perf_init_event_pmu_context(&cpc->epc, pmu);
 		__perf_mux_hrtimer_init(cpc, cpu);
 	}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 17/24] perf: Introduce perf_free_addr_filters()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (15 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 16/24] perf: Detach perf_cpu_pmu_context and pmu lifetimes Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 18/24] perf: Robustify perf_event_free_bpf_prog() Peter Zijlstra
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Replace _free_event()'s use of perf_addr_filters_splice()s use with an
explicit perf_free_addr_filters() with the explicit propery that it is
able to be called a second time without ill effect.

Most notable, referencing event->pmu must be avoided when there are no
filters left (from eg a previous call).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5307,8 +5307,7 @@ static bool exclusive_event_installable(
 	return true;
 }
 
-static void perf_addr_filters_splice(struct perf_event *event,
-				       struct list_head *head);
+static void perf_free_addr_filters(struct perf_event *event);
 
 static void perf_pending_task_sync(struct perf_event *event)
 {
@@ -5407,7 +5406,7 @@ static void _free_event(struct perf_even
 	}
 
 	perf_event_free_bpf_prog(event);
-	perf_addr_filters_splice(event, NULL);
+	perf_free_addr_filters(event);
 
 	__free_event(event);
 }
@@ -10880,6 +10882,17 @@ static void perf_addr_filters_splice(str
 	free_filters_list(&list);
 }
 
+static void perf_free_addr_filters(struct perf_event *event)
+{
+	/*
+	 * Used during free paths, there is no concurrency.
+	 */
+	if (list_empty(&event->addr_filters.list))
+		return;
+
+	perf_addr_filters_splice(event, NULL);
+}
+
 /*
  * Scan through mm's vmas and see if one of them matches the
  * @filter; if so, adjust filter's address range.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 18/24] perf: Robustify perf_event_free_bpf_prog()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (16 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 17/24] perf: Introduce perf_free_addr_filters() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 19/24] perf: Simplify perf_mmap() control flow Peter Zijlstra
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Ensure perf_event_free_bpf_prog() is safe to call a second time;
notably without making any references to event->pmu when there is no
prog left.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10782,6 +10781,9 @@ int perf_event_set_bpf_prog(struct perf_
 
 void perf_event_free_bpf_prog(struct perf_event *event)
 {
+	if (!event->prog)
+		return;
+
 	if (!perf_event_is_tracing(event)) {
 		perf_event_free_bpf_handler(event);
 		return;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 19/24] perf: Simplify perf_mmap() control flow
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (17 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 18/24] perf: Robustify perf_event_free_bpf_prog() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-03-03  5:39   ` Ravi Bangoria
  2025-02-05 10:21 ` [PATCH v2 20/24] perf: Fix perf_mmap() failure path Peter Zijlstra
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

  if (c) {
    X1;
  } else {
    Y;
    goto l;
  }

  X2;
l:

into:

  if (c) {
    X1;
    X2;
  } else {
    Y;
  }

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   73 ++++++++++++++++++++++++---------------------------
 1 file changed, 35 insertions(+), 38 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6664,6 +6664,41 @@ static int perf_mmap(struct file *file,
 
 	if (vma->vm_pgoff == 0) {
 		nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+		/*
+		 * If we have rb pages ensure they're a power-of-two number, so we
+		 * can do bitmasks instead of modulo.
+		 */
+		if (nr_pages != 0 && !is_power_of_2(nr_pages))
+			return -EINVAL;
+
+		if (vma_size != PAGE_SIZE * (1 + nr_pages))
+			return -EINVAL;
+
+		WARN_ON_ONCE(event->ctx->parent_ctx);
+again:
+		mutex_lock(&event->mmap_mutex);
+		if (event->rb) {
+			if (data_page_nr(event->rb) != nr_pages) {
+				ret = -EINVAL;
+				goto unlock;
+			}
+
+			if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
+				/*
+				 * Raced against perf_mmap_close(); remove the
+				 * event and try again.
+				 */
+				ring_buffer_attach(event, NULL);
+				mutex_unlock(&event->mmap_mutex);
+				goto again;
+			}
+
+			rb = event->rb;
+			goto unlock;
+		}
+
+		user_extra = nr_pages + 1;
 	} else {
 		/*
 		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
@@ -6723,47 +6758,9 @@ static int perf_mmap(struct file *file,
 
 		atomic_set(&rb->aux_mmap_count, 1);
 		user_extra = nr_pages;
-
-		goto accounting;
 	}
 
-	/*
-	 * If we have rb pages ensure they're a power-of-two number, so we
-	 * can do bitmasks instead of modulo.
-	 */
-	if (nr_pages != 0 && !is_power_of_2(nr_pages))
-		return -EINVAL;
-
-	if (vma_size != PAGE_SIZE * (1 + nr_pages))
-		return -EINVAL;
-
-	WARN_ON_ONCE(event->ctx->parent_ctx);
-again:
-	mutex_lock(&event->mmap_mutex);
-	if (event->rb) {
-		if (data_page_nr(event->rb) != nr_pages) {
-			ret = -EINVAL;
-			goto unlock;
-		}
-
-		if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
-			/*
-			 * Raced against perf_mmap_close(); remove the
-			 * event and try again.
-			 */
-			ring_buffer_attach(event, NULL);
-			mutex_unlock(&event->mmap_mutex);
-			goto again;
-		}
-
 		/* We need the rb to map pages. */
-		rb = event->rb;
-		goto unlock;
-	}
-
-	user_extra = nr_pages + 1;
-
-accounting:
 	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
 
 	/*



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 20/24] perf: Fix perf_mmap() failure path
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (18 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 19/24] perf: Simplify perf_mmap() control flow Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 21/24] perf: Further simplify perf_mmap() Peter Zijlstra
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

When f_ops->mmap() returns failure, m_ops->close() is *not* called.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6852,7 +6852,7 @@ static int perf_mmap(struct file *file,
 	if (!ret)
 		ret = map_range(rb, vma);
 
-	if (event->pmu->event_mapped)
+	if (!ret && event->pmu->event_mapped)
 		event->pmu->event_mapped(event, vma->vm_mm);
 
 	return ret;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 21/24] perf: Further simplify perf_mmap()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (19 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 20/24] perf: Fix perf_mmap() failure path Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 22/24] perf: Remove retry loop from perf_mmap() Peter Zijlstra
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6661,9 +6661,18 @@ static int perf_mmap(struct file *file,
 		return ret;
 
 	vma_size = vma->vm_end - vma->vm_start;
+	nr_pages = vma_size / PAGE_SIZE;
+
+	if (nr_pages > INT_MAX)
+		return -ENOMEM;
+
+	if (vma_size != PAGE_SIZE * nr_pages)
+		return -EINVAL;
+
+	user_extra = nr_pages;
 
 	if (vma->vm_pgoff == 0) {
-		nr_pages = (vma_size / PAGE_SIZE) - 1;
+		nr_pages -= 1;
 
 		/*
 		 * If we have rb pages ensure they're a power-of-two number, so we
@@ -6672,9 +6681,6 @@ static int perf_mmap(struct file *file,
 		if (nr_pages != 0 && !is_power_of_2(nr_pages))
 			return -EINVAL;
 
-		if (vma_size != PAGE_SIZE * (1 + nr_pages))
-			return -EINVAL;
-
 		WARN_ON_ONCE(event->ctx->parent_ctx);
 again:
 		mutex_lock(&event->mmap_mutex);
@@ -6697,8 +6703,6 @@ static int perf_mmap(struct file *file,
 			rb = event->rb;
 			goto unlock;
 		}
-
-		user_extra = nr_pages + 1;
 	} else {
 		/*
 		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
@@ -6710,10 +6714,6 @@ static int perf_mmap(struct file *file,
 		if (!event->rb)
 			return -EINVAL;
 
-		nr_pages = vma_size / PAGE_SIZE;
-		if (nr_pages > INT_MAX)
-			return -ENOMEM;
-
 		mutex_lock(&event->mmap_mutex);
 		ret = -EINVAL;
 
@@ -6757,7 +6757,6 @@ static int perf_mmap(struct file *file,
 		}
 
 		atomic_set(&rb->aux_mmap_count, 1);
-		user_extra = nr_pages;
 	}
 
 		/* We need the rb to map pages. */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 22/24] perf: Remove retry loop from perf_mmap()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (20 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 21/24] perf: Further simplify perf_mmap() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 23/24] perf: Lift event->mmap_mutex in perf_mmap() Peter Zijlstra
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

AFAICT there is no actual benefit from the mutex drop on re-try. The
'worst' case scenario is that we instantly re-gain the mutex without
perf_mmap_close() getting it. So might as well make that the normal
case.

Reflow the code to make the ring buffer detach case naturally flow
into the no ring buffer case.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6667,27 +6667,32 @@ static int perf_mmap(struct file *file,
 			return -EINVAL;
 
 		WARN_ON_ONCE(event->ctx->parent_ctx);
-again:
 		mutex_lock(&event->mmap_mutex);
+
 		if (event->rb) {
 			if (data_page_nr(event->rb) != nr_pages) {
 				ret = -EINVAL;
 				goto unlock;
 			}
 
-			if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
+			if (atomic_inc_not_zero(&event->rb->mmap_count)) {
 				/*
-				 * Raced against perf_mmap_close(); remove the
-				 * event and try again.
+				 * Success -- managed to mmap() the same buffer
+				 * multiple times.
 				 */
-				ring_buffer_attach(event, NULL);
-				mutex_unlock(&event->mmap_mutex);
-				goto again;
+				rb = event->rb;
+				ret = 0;
+				goto unlock;
 			}
 
-			rb = event->rb;
-			goto unlock;
+			/*
+			 * Raced against perf_mmap_close()'s
+			 * atomic_dec_and_mutex_lock() remove the
+			 * event and continue as if !event->rb
+			 */
+			ring_buffer_attach(event, NULL);
 		}
+
 	} else {
 		/*
 		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 23/24] perf: Lift event->mmap_mutex in perf_mmap()
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (21 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 22/24] perf: Remove retry loop from perf_mmap() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-05 10:21 ` [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable Peter Zijlstra
  2025-03-03  6:01 ` [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Ravi Bangoria
  24 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

This puts 'all' of perf_mmap() under single event->mmap_mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6643,7 +6643,7 @@ static int perf_mmap(struct file *file,
 	unsigned long vma_size;
 	unsigned long nr_pages;
 	long user_extra = 0, extra = 0;
-	int ret = 0, flags = 0;
+	int ret, flags = 0;
 
 	/*
 	 * Don't allow mmap() of inherited per-task counters. This would
@@ -6671,6 +6671,9 @@ static int perf_mmap(struct file *file,
 
 	user_extra = nr_pages;
 
+	mutex_lock(&event->mmap_mutex);
+	ret = -EINVAL;
+
 	if (vma->vm_pgoff == 0) {
 		nr_pages -= 1;
 
@@ -6679,16 +6682,13 @@ static int perf_mmap(struct file *file,
 		 * can do bitmasks instead of modulo.
 		 */
 		if (nr_pages != 0 && !is_power_of_2(nr_pages))
-			return -EINVAL;
+			goto unlock;
 
 		WARN_ON_ONCE(event->ctx->parent_ctx);
-		mutex_lock(&event->mmap_mutex);
 
 		if (event->rb) {
-			if (data_page_nr(event->rb) != nr_pages) {
-				ret = -EINVAL;
+			if (data_page_nr(event->rb) != nr_pages)
 				goto unlock;
-			}
 
 			if (atomic_inc_not_zero(&event->rb->mmap_count)) {
 				/*
@@ -6716,12 +6716,6 @@ static int perf_mmap(struct file *file,
 		 */
 		u64 aux_offset, aux_size;
 
-		if (!event->rb)
-			return -EINVAL;
-
-		mutex_lock(&event->mmap_mutex);
-		ret = -EINVAL;
-
 		rb = event->rb;
 		if (!rb)
 			goto aux_unlock;
@@ -6832,6 +6826,8 @@ static int perf_mmap(struct file *file,
 			rb->aux_mmap_locked = extra;
 	}
 
+	ret = 0;
+
 unlock:
 	if (!ret) {
 		atomic_long_add(user_extra, &user->locked_vm);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (22 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 23/24] perf: Lift event->mmap_mutex in perf_mmap() Peter Zijlstra
@ 2025-02-05 10:21 ` Peter Zijlstra
  2025-02-10  6:39   ` Ravi Bangoria
                     ` (2 more replies)
  2025-03-03  6:01 ` [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Ravi Bangoria
  24 siblings, 3 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-05 10:21 UTC (permalink / raw)
  To: mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, peterz, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

Previously it was only safe to call perf_pmu_unregister() if there
were no active events of that pmu around -- which was impossible to
guarantee since it races all sorts against perf_init_event().

Rework the whole thing by:

 - keeping track of all events for a given pmu

 - 'hiding' the pmu from perf_init_event()

 - waiting for the appropriate (s)rcu grace periods such that all
   prior references to the PMU will be completed

 - detaching all still existing events of that pmu (see first point)
   and moving them to a new REVOKED state.

 - actually freeing the pmu data.

Where notably the new REVOKED state must inhibit all event actions
from reaching code that wants to use event->pmu.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |   15 +-
 kernel/events/core.c       |  266 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 249 insertions(+), 32 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -325,6 +325,9 @@ struct perf_output_handle;
 struct pmu {
 	struct list_head		entry;
 
+	spinlock_t			events_lock;
+	struct list_head		events;
+
 	struct module			*module;
 	struct device			*dev;
 	struct device			*parent;
@@ -632,9 +635,10 @@ struct perf_addr_filter_range {
  * enum perf_event_state - the states of an event:
  */
 enum perf_event_state {
-	PERF_EVENT_STATE_DEAD		= -4,
-	PERF_EVENT_STATE_EXIT		= -3,
-	PERF_EVENT_STATE_ERROR		= -2,
+	PERF_EVENT_STATE_DEAD		= -5,
+	PERF_EVENT_STATE_REVOKED	= -4, /* pmu gone, must not touch */
+	PERF_EVENT_STATE_EXIT		= -3, /* task died, still inherit */
+	PERF_EVENT_STATE_ERROR		= -2, /* scheduling error, can enable */
 	PERF_EVENT_STATE_OFF		= -1,
 	PERF_EVENT_STATE_INACTIVE	=  0,
 	PERF_EVENT_STATE_ACTIVE		=  1,
@@ -875,6 +879,7 @@ struct perf_event {
 	void *security;
 #endif
 	struct list_head		sb_list;
+	struct list_head		pmu_list;
 
 	/*
 	 * Certain events gets forwarded to another pmu internally by over-
@@ -1126,7 +1131,7 @@ extern void perf_aux_output_flag(struct
 extern void perf_event_itrace_started(struct perf_event *event);
 
 extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
-extern void perf_pmu_unregister(struct pmu *pmu);
+extern int perf_pmu_unregister(struct pmu *pmu);
 
 extern void __perf_event_task_sched_in(struct task_struct *prev,
 				       struct task_struct *task);
@@ -1737,7 +1742,7 @@ static inline bool needs_branch_stack(st
 
 static inline bool has_aux(struct perf_event *event)
 {
-	return event->pmu->setup_aux;
+	return event->pmu && event->pmu->setup_aux;
 }
 
 static inline bool has_aux_action(struct perf_event *event)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2418,7 +2418,9 @@ ctx_time_update_event(struct perf_event_
 
 #define DETACH_GROUP	0x01UL
 #define DETACH_CHILD	0x02UL
-#define DETACH_DEAD	0x04UL
+#define DETACH_EXIT	0x04UL
+#define DETACH_REVOKE	0x08UL
+#define DETACH_DEAD	0x10UL
 
 /*
  * Cross CPU call to remove a performance event
@@ -2433,6 +2435,7 @@ __perf_remove_from_context(struct perf_e
 			   void *info)
 {
 	struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
+	enum perf_event_state state = PERF_EVENT_STATE_OFF;
 	unsigned long flags = (unsigned long)info;
 
 	ctx_time_update(cpuctx, ctx);
@@ -2441,16 +2444,22 @@ __perf_remove_from_context(struct perf_e
 	 * Ensure event_sched_out() switches to OFF, at the very least
 	 * this avoids raising perf_pending_task() at this time.
 	 */
-	if (flags & DETACH_DEAD)
+	if (flags & DETACH_EXIT)
+		state = PERF_EVENT_STATE_EXIT;
+	if (flags & DETACH_REVOKE)
+		state = PERF_EVENT_STATE_REVOKED;
+	if (flags & DETACH_DEAD) {
 		event->pending_disable = 1;
+		state = PERF_EVENT_STATE_DEAD;
+	}
 	event_sched_out(event, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	if (flags & DETACH_CHILD)
 		perf_child_detach(event);
 	list_del_event(event, ctx);
-	if (flags & DETACH_DEAD)
-		event->state = PERF_EVENT_STATE_DEAD;
+
+	event->state = min(event->state, state);
 
 	if (!pmu_ctx->nr_events) {
 		pmu_ctx->rotate_necessary = 0;
@@ -4510,7 +4519,8 @@ static void perf_event_enable_on_exec(st
 
 static void perf_remove_from_owner(struct perf_event *event);
 static void perf_event_exit_event(struct perf_event *event,
-				  struct perf_event_context *ctx);
+				  struct perf_event_context *ctx,
+				  bool revoke);
 
 /*
  * Removes all events from the current task that have been marked
@@ -4537,7 +4547,7 @@ static void perf_event_remove_on_exec(st
 
 		modified = true;
 
-		perf_event_exit_event(event, ctx);
+		perf_event_exit_event(event, ctx, false);
 	}
 
 	raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -5137,6 +5147,7 @@ static bool is_sb_event(struct perf_even
 	    attr->context_switch || attr->text_poke ||
 	    attr->bpf_event)
 		return true;
+
 	return false;
 }
 
@@ -5338,6 +5349,8 @@ static void perf_pending_task_sync(struc
 /* vs perf_event_alloc() error */
 static void __free_event(struct perf_event *event)
 {
+	struct pmu *pmu = event->pmu;
+
 	if (event->attach_state & PERF_ATTACH_CALLCHAIN)
 		put_callchain_buffers();
 
@@ -5364,6 +5377,7 @@ static void __free_event(struct perf_eve
 		 * put_pmu_ctx() needs an event->ctx reference, because of
 		 * epc->ctx.
 		 */
+		WARN_ON_ONCE(!pmu);
 		WARN_ON_ONCE(!event->ctx);
 		WARN_ON_ONCE(event->pmu_ctx->ctx != event->ctx);
 		put_pmu_ctx(event->pmu_ctx);
@@ -5376,8 +5390,13 @@ static void __free_event(struct perf_eve
 	if (event->ctx)
 		put_ctx(event->ctx);
 
-	if (event->pmu)
-		module_put(event->pmu->module);
+	if (pmu) {
+		module_put(pmu->module);
+		scoped_guard (spinlock, &pmu->events_lock) {
+			list_del(&event->pmu_list);
+			wake_up_var(pmu);
+		}
+	}
 
 	call_rcu(&event->rcu_head, free_event_rcu);
 }
@@ -5525,7 +5544,11 @@ int perf_event_release_kernel(struct per
 	 * Thus this guarantees that we will in fact observe and kill _ALL_
 	 * child events.
 	 */
-	perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+	if (event->state > PERF_EVENT_STATE_REVOKED) {
+		perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+	} else {
+		event->state = PERF_EVENT_STATE_DEAD;
+	}
 
 	perf_event_ctx_unlock(event, ctx);
 
@@ -5578,7 +5601,7 @@ int perf_event_release_kernel(struct per
 		mutex_unlock(&ctx->mutex);
 
 		if (child)
-			free_event(child);
+			put_event(child);
 		put_ctx(ctx);
 
 		goto again;
@@ -5813,7 +5836,7 @@ __perf_read(struct perf_event *event, ch
 	 * error state (i.e. because it was pinned but it couldn't be
 	 * scheduled on to the CPU at some point).
 	 */
-	if (event->state == PERF_EVENT_STATE_ERROR)
+	if (event->state <= PERF_EVENT_STATE_ERROR)
 		return 0;
 
 	if (count < event->read_size)
@@ -5852,8 +5875,14 @@ static __poll_t perf_poll(struct file *f
 	struct perf_buffer *rb;
 	__poll_t events = EPOLLHUP;
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return EPOLLERR;
+
 	poll_wait(file, &event->waitq, wait);
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return EPOLLERR;
+
 	if (is_event_hup(event))
 		return events;
 
@@ -6027,6 +6056,9 @@ static long _perf_ioctl(struct perf_even
 	void (*func)(struct perf_event *);
 	u32 flags = arg;
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return -ENODEV;
+
 	switch (cmd) {
 	case PERF_EVENT_IOC_ENABLE:
 		func = _perf_event_enable;
@@ -6412,7 +6444,7 @@ static void perf_mmap_open(struct vm_are
 	if (vma->vm_pgoff)
 		atomic_inc(&event->rb->aux_mmap_count);
 
-	if (event->pmu->event_mapped)
+	if (event->pmu && event->pmu->event_mapped)
 		event->pmu->event_mapped(event, vma->vm_mm);
 }
 
@@ -6435,7 +6467,8 @@ static void perf_mmap_close(struct vm_ar
 	unsigned long size = perf_data_size(rb);
 	bool detach_rest = false;
 
-	if (event->pmu->event_unmapped)
+	/* FIXIES vs perf_pmu_unregister() */
+	if (event->pmu && event->pmu->event_unmapped)
 		event->pmu->event_unmapped(event, vma->vm_mm);
 
 	/*
@@ -6659,6 +6692,16 @@ static int perf_mmap(struct file *file,
 	mutex_lock(&event->mmap_mutex);
 	ret = -EINVAL;
 
+	/*
+	 * This relies on __pmu_detach_event() taking mmap_mutex after marking
+	 * the event REVOKED. Either we observe the state, or __pmu_detach_event()
+	 * will detach the rb created here.
+	 */
+	if (event->state <= PERF_EVENT_STATE_REVOKED) {
+		ret = -ENODEV;
+		goto unlock;
+	}
+
 	if (vma->vm_pgoff == 0) {
 		nr_pages -= 1;
 
@@ -6837,7 +6880,7 @@ static int perf_mmap(struct file *file,
 	if (!ret)
 		ret = map_range(rb, vma);
 
-	if (!ret && event->pmu->event_mapped)
+	if (!ret && event->pmu && event->pmu->event_mapped)
 		event->pmu->event_mapped(event, vma->vm_mm);
 
 	return ret;
@@ -6849,6 +6892,9 @@ static int perf_fasync(int fd, struct fi
 	struct perf_event *event = filp->private_data;
 	int retval;
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return -ENODEV;
+
 	inode_lock(inode);
 	retval = fasync_helper(fd, filp, on, &event->fasync);
 	inode_unlock(inode);
@@ -10813,6 +10859,9 @@ static int __perf_event_set_bpf_prog(str
 {
 	bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return -ENODEV;
+
 	if (!perf_event_is_tracing(event))
 		return perf_event_set_bpf_handler(event, prog, bpf_cookie);
 
@@ -11997,6 +12046,9 @@ int perf_pmu_register(struct pmu *_pmu,
 	if (!pmu->event_idx)
 		pmu->event_idx = perf_event_idx_default;
 
+	INIT_LIST_HEAD(&pmu->events);
+	spin_lock_init(&pmu->events_lock);
+
 	/*
 	 * Now that the PMU is complete, make it visible to perf_try_init_event().
 	 */
@@ -12010,21 +12062,143 @@ int perf_pmu_register(struct pmu *_pmu,
 }
 EXPORT_SYMBOL_GPL(perf_pmu_register);
 
-void perf_pmu_unregister(struct pmu *pmu)
+static void __pmu_detach_event(struct pmu *pmu, struct perf_event *event,
+			       struct perf_event_context *ctx)
+{
+	/*
+	 * De-schedule the event and mark it REVOKED.
+	 */
+	perf_event_exit_event(event, ctx, true);
+
+	/*
+	 * All _free_event() bits that rely on event->pmu:
+	 *
+	 * Notably, perf_mmap() relies on the ordering here.
+	 */
+	scoped_guard (mutex, &event->mmap_mutex) {
+		WARN_ON_ONCE(pmu->event_unmapped);
+		/*
+		 * Mostly an empty lock sequence, such that perf_mmap(), which
+		 * relies on mmap_mutex, is sure to observe the state change.
+		 */
+	}
+
+	perf_event_free_bpf_prog(event);
+	perf_free_addr_filters(event);
+
+	if (event->destroy) {
+		event->destroy(event);
+		event->destroy = NULL;
+	}
+
+	if (event->pmu_ctx) {
+		put_pmu_ctx(event->pmu_ctx);
+		event->pmu_ctx = NULL;
+	}
+
+	exclusive_event_destroy(event);
+	module_put(pmu->module);
+
+	event->pmu = NULL; /* force fault instead of UAF */
+}
+
+static void pmu_detach_event(struct pmu *pmu, struct perf_event *event)
+{
+	struct perf_event_context *ctx;
+
+	ctx = perf_event_ctx_lock(event);
+	__pmu_detach_event(pmu, event, ctx);
+	perf_event_ctx_unlock(event, ctx);
+
+	scoped_guard (spinlock, &pmu->events_lock)
+		list_del(&event->pmu_list);
+}
+
+static struct perf_event *pmu_get_event(struct pmu *pmu)
+{
+	struct perf_event *event;
+
+	guard(spinlock)(&pmu->events_lock);
+	list_for_each_entry(event, &pmu->events, pmu_list) {
+		if (atomic_long_inc_not_zero(&event->refcount))
+			return event;
+	}
+
+	return NULL;
+}
+
+static bool pmu_empty(struct pmu *pmu)
+{
+	guard(spinlock)(&pmu->events_lock);
+	return list_empty(&pmu->events);
+}
+
+static void pmu_detach_events(struct pmu *pmu)
+{
+	struct perf_event *event;
+
+	for (;;) {
+		event = pmu_get_event(pmu);
+		if (!event)
+			break;
+
+		pmu_detach_event(pmu, event);
+		put_event(event);
+	}
+
+	/*
+	 * wait for pending _free_event()s
+	 */
+	wait_var_event(pmu, pmu_empty(pmu));
+}
+
+int perf_pmu_unregister(struct pmu *pmu)
 {
 	scoped_guard (mutex, &pmus_lock) {
+		if (!idr_cmpxchg(&pmu_idr, pmu->type, pmu, NULL))
+			return -EINVAL;
+
 		list_del_rcu(&pmu->entry);
-		idr_remove(&pmu_idr, pmu->type);
 	}
 
 	/*
 	 * We dereference the pmu list under both SRCU and regular RCU, so
 	 * synchronize against both of those.
+	 *
+	 * Notably, the entirety of event creation, from perf_init_event()
+	 * (which will now fail, because of the above) until
+	 * perf_install_in_context() should be under SRCU such that
+	 * this synchronizes against event creation. This avoids trying to
+	 * detach events that are not fully formed.
 	 */
 	synchronize_srcu(&pmus_srcu);
 	synchronize_rcu();
 
+	if (pmu->event_unmapped && !pmu_empty(pmu)) {
+		/*
+		 * Can't force remove events when pmu::event_unmapped()
+		 * is used in perf_mmap_close().
+		 */
+		guard(mutex)(&pmus_lock);
+		idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu);
+		list_add_rcu(&pmu->entry, &pmus);
+		return -EBUSY;
+	}
+
+	scoped_guard (mutex, &pmus_lock)
+		idr_remove(&pmu_idr, pmu->type);
+
+	/*
+	 * PMU is removed from the pmus list, so no new events will
+	 * be created, now take care of the existing ones.
+	 */
+	pmu_detach_events(pmu);
+
+	/*
+	 * PMU is unused, make it go away.
+	 */
 	perf_pmu_free(pmu);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
@@ -12107,7 +12281,7 @@ static struct pmu *perf_init_event(struc
 	struct pmu *pmu;
 	int type, ret;
 
-	guard(srcu)(&pmus_srcu);
+	guard(srcu)(&pmus_srcu); /* pmu idr/list access */
 
 	/*
 	 * Save original type before calling pmu->event_init() since certain
@@ -12331,6 +12505,7 @@ perf_event_alloc(struct perf_event_attr
 	INIT_LIST_HEAD(&event->active_entry);
 	INIT_LIST_HEAD(&event->addr_filters.list);
 	INIT_HLIST_NODE(&event->hlist_entry);
+	INIT_LIST_HEAD(&event->pmu_list);
 
 
 	init_waitqueue_head(&event->waitq);
@@ -12498,6 +12673,13 @@ perf_event_alloc(struct perf_event_attr
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
+	/*
+	 * Event creation should be under SRCU, see perf_pmu_unregister().
+	 */
+	lockdep_assert_held(&pmus_srcu);
+	scoped_guard (spinlock, &pmu->events_lock)
+		list_add(&event->pmu_list, &pmu->events);
+
 	return_ptr(event);
 }
 
@@ -12697,6 +12879,9 @@ perf_event_set_output(struct perf_event
 		goto unlock;
 
 	if (output_event) {
+		if (output_event->state <= PERF_EVENT_STATE_REVOKED)
+			goto unlock;
+
 		/* get the rb we want to redirect to */
 		rb = ring_buffer_get(output_event);
 		if (!rb)
@@ -12878,6 +13063,11 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (event_fd < 0)
 		return event_fd;
 
+	/*
+	 * Event creation should be under SRCU, see perf_pmu_unregister().
+	 */
+	guard(srcu)(&pmus_srcu);
+
 	CLASS(fd, group)(group_fd);     // group_fd == -1 => empty
 	if (group_fd != -1) {
 		if (!is_perf_file(group)) {
@@ -12885,6 +13075,10 @@ SYSCALL_DEFINE5(perf_event_open,
 			goto err_fd;
 		}
 		group_leader = fd_file(group)->private_data;
+		if (group_leader->state <= PERF_EVENT_STATE_REVOKED) {
+			err = -ENODEV;
+			goto err_fd;
+		}
 		if (flags & PERF_FLAG_FD_OUTPUT)
 			output_event = group_leader;
 		if (flags & PERF_FLAG_FD_NO_GROUP)
@@ -13218,6 +13412,11 @@ perf_event_create_kernel_counter(struct
 	if (attr->aux_output || attr->aux_action)
 		return ERR_PTR(-EINVAL);
 
+	/*
+	 * Event creation should be under SRCU, see perf_pmu_unregister().
+	 */
+	guard(srcu)(&pmus_srcu);
+
 	event = perf_event_alloc(attr, cpu, task, NULL, NULL,
 				 overflow_handler, context, -1);
 	if (IS_ERR(event)) {
@@ -13429,10 +13628,11 @@ static void sync_child_event(struct perf
 }
 
 static void
-perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
+perf_event_exit_event(struct perf_event *event,
+		      struct perf_event_context *ctx, bool revoke)
 {
 	struct perf_event *parent_event = event->parent;
-	unsigned long detach_flags = 0;
+	unsigned long detach_flags = DETACH_EXIT;
 
 	if (parent_event) {
 		/*
@@ -13447,16 +13647,14 @@ perf_event_exit_event(struct perf_event
 		 * Do destroy all inherited groups, we don't care about those
 		 * and being thorough is better.
 		 */
-		detach_flags = DETACH_GROUP | DETACH_CHILD;
+		detach_flags |= DETACH_GROUP | DETACH_CHILD;
 		mutex_lock(&parent_event->child_mutex);
 	}
 
-	perf_remove_from_context(event, detach_flags);
+	if (revoke)
+		detach_flags |= DETACH_GROUP | DETACH_REVOKE;
 
-	raw_spin_lock_irq(&ctx->lock);
-	if (event->state > PERF_EVENT_STATE_EXIT)
-		perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
-	raw_spin_unlock_irq(&ctx->lock);
+	perf_remove_from_context(event, detach_flags);
 
 	/*
 	 * Child events can be freed.
@@ -13467,7 +13665,13 @@ perf_event_exit_event(struct perf_event
 		 * Kick perf_poll() for is_event_hup();
 		 */
 		perf_event_wakeup(parent_event);
-		free_event(event);
+		/*
+		 * pmu_detach_event() will have an extra refcount.
+		 */
+		if (revoke)
+			put_event(event);
+		else
+			free_event(event);
 		put_event(parent_event);
 		return;
 	}
@@ -13532,7 +13736,7 @@ static void perf_event_exit_task_context
 	perf_event_task(child, child_ctx, 0);
 
 	list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
-		perf_event_exit_event(child_event, child_ctx);
+		perf_event_exit_event(child_event, child_ctx, false);
 
 	mutex_unlock(&child_ctx->mutex);
 
@@ -13722,6 +13926,14 @@ inherit_event(struct perf_event *parent_
 	if (parent_event->parent)
 		parent_event = parent_event->parent;
 
+	if (parent_event->state <= PERF_EVENT_STATE_REVOKED)
+		return NULL;
+
+	/*
+	 * Event creation should be under SRCU, see perf_pmu_unregister().
+	 */
+	guard(srcu)(&pmus_srcu);
+
 	child_event = perf_event_alloc(&parent_event->attr,
 					   parent_event->cpu,
 					   child,



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/24] lockdep: Fix might_fault()
  2025-02-05 10:21 ` [PATCH v2 01/24] lockdep: Fix might_fault() Peter Zijlstra
@ 2025-02-06 18:19   ` David Hildenbrand
  0 siblings, 0 replies; 47+ messages in thread
From: David Hildenbrand @ 2025-02-06 18:19 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, ravi.bangoria, lucas.demarchi
  Cc: linux-kernel, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang

On 05.02.25 11:21, Peter Zijlstra wrote:
> Turns out that commit 9ec23531fd48 ("sched/preempt, mm/fault: Trigger
> might_sleep() in might_fault() with disabled pagefaults") accidentally
> (and unnessecarily) put the lockdep part of __might_fault() under
> CONFIG_DEBUG_ATOMIC_SLEEP.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Fixes: 9ec23531fd48 ("sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults")

Heh, that was 10 years ago, part one of my first not-kvm-specific patch 
series.

Very likely it was not by accident, but I don't have any memory on that 
part :)

Probably I actually wanted this to be

  #if defined(CONFIG_PROVE_LOCKING)

such that the function looked more similar to pre-662bbcb2747c


> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   mm/memory.c |    2 --
>   1 file changed, 2 deletions(-)
> 
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6695,10 +6695,8 @@ void __might_fault(const char *file, int
>   	if (pagefault_disabled())
>   		return;
>   	__might_sleep(file, line);
> -#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
>   	if (current->mm)
>   		might_lock_read(&current->mm->mmap_lock);
> -#endif
>   }
>   EXPORT_SYMBOL(__might_fault);
>   #endif
> 
> 


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-05 10:21 ` [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable Peter Zijlstra
@ 2025-02-10  6:39   ` Ravi Bangoria
  2025-02-11 15:46     ` Peter Zijlstra
  2025-02-10  6:42   ` Ravi Bangoria
  2025-02-10  6:59   ` Ravi Bangoria
  2 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-10  6:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@kernel.org, lucas.demarchi@intel.com,
	linux-kernel@vger.kernel.org, willy@infradead.org,
	acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	kan.liang@linux.intel.com, Ravi Bangoria

Hi Peter,

> @@ -1737,7 +1742,7 @@ static inline bool needs_branch_stack(st
>  
>  static inline bool has_aux(struct perf_event *event)
>  {
> -	return event->pmu->setup_aux;
> +	return event->pmu && event->pmu->setup_aux;
>  }

this ...

> @@ -6412,7 +6444,7 @@ static void perf_mmap_open(struct vm_are
>  	if (vma->vm_pgoff)
>  		atomic_inc(&event->rb->aux_mmap_count);
>  
> -	if (event->pmu->event_mapped)
> +	if (event->pmu && event->pmu->event_mapped)
>  		event->pmu->event_mapped(event, vma->vm_mm);
>  }
>  
> @@ -6435,7 +6467,8 @@ static void perf_mmap_close(struct vm_ar
>  	unsigned long size = perf_data_size(rb);
>  	bool detach_rest = false;
>  
> -	if (event->pmu->event_unmapped)
> +	/* FIXIES vs perf_pmu_unregister() */
> +	if (event->pmu && event->pmu->event_unmapped)
>  		event->pmu->event_unmapped(event, vma->vm_mm);

these two ...

> @@ -6837,7 +6880,7 @@ static int perf_mmap(struct file *file,
>  	if (!ret)
>  		ret = map_range(rb, vma);
>  
> -	if (!ret && event->pmu->event_mapped)
> +	if (!ret && event->pmu && event->pmu->event_mapped)
>  		event->pmu->event_mapped(event, vma->vm_mm);

... and this one.

There is no serialization at all these places. i.e. event->pmu can become
NULL in between two checks.

  CPU0:            event->pmu = NULL
                         |
                         V
  CPU1:   if (event->pmu && event->pmu->blahblah())

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-05 10:21 ` [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable Peter Zijlstra
  2025-02-10  6:39   ` Ravi Bangoria
@ 2025-02-10  6:42   ` Ravi Bangoria
  2025-02-12 12:49     ` Peter Zijlstra
  2025-02-10  6:59   ` Ravi Bangoria
  2 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-10  6:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang, Ravi Bangoria

> @@ -13467,7 +13665,13 @@ perf_event_exit_event(struct perf_event
>  		 * Kick perf_poll() for is_event_hup();
>  		 */
>  		perf_event_wakeup(parent_event);
> -		free_event(event);
> +		/*
> +		 * pmu_detach_event() will have an extra refcount.
> +		 */
> +		if (revoke)
> +			put_event(event);
> +		else
> +			free_event(event);
>  		put_event(parent_event);
>  		return;
>  	}

There is a race between do_exit() and perf_pmu_unregister():

Assume: a Parent process with 'event1' and a Child process with an
inherited 'event2'.

                  CPU 1                                                       CPU 2

      Child process exiting ...
   1  do_exit()
   2    perf_event_exit_task()
   3      perf_event_exit_task_context()
   4        mutex_lock(&child_ctx->mutex);
   5        list_for_each_entry_safe(&child_ctx->event_list)
   6        /* picks event2. event2->refcount is 1 */
   7                                                             perf_pmu_unregister()
   8                                                               pmu_detach_events()
   9                                                                 pmu_get_event(pmu) /* Picks event2 */
  10                                                                   atomic_long_inc_not_zero(&event->refcount) /* event2 */
  11                                                                   /* event2->refcount becomes 2 */
  12                                                                 pmu_detach_event() /* event2 */
  13                                                                   perf_event_ctx_lock(); /* event2 */
  14        /* event2->refcount became 2 (by CPU 2) */                 .
  15      perf_event_exit_event() /* event2 */                         . /* CPU 1 is holding ctx->mutex of
  16        if (parent_event) { /* true */                             .    child process. Waiting ... */
  17          if (revoke) /* false */                                  .
  18          else                                                     .
  19              free_event();  /* event2 */                          .
  20                  WARN()                                           .

                      ^^^^^^

This race manifests as:

  unexpected event refcount: 2; ptr=00000000c6ca2049
  WARNING: CPU: 57 PID: 9729 at kernel/events/core.c:5439 free_event+0x48/0x60
  RIP: 0010:free_event+0x48/0x60
  Call Trace:
   ? free_event+0x48/0x60
   perf_event_exit_event.isra.0+0x60/0xf0
   perf_event_exit_task+0x1e1/0x280
   do_exit+0x327/0xb00

The race continues... now, with the stale child2->parent pointer:

                  CPU 1                                                       CPU 2

  15      perf_event_exit_event() /* event2 */                         . /* CPU 1 is holding ctx->mutex of
  16        if (parent_event) { /* true */                             .    child process. Waiting ... */
  17          if (revoke) /* false */                                  .
  18          else                                                     .
  19              free_event();  /* event2 */                          .
  20                  WARN()                                           .
  21          put_event(parent_event); /* event1 */                    .
  22          /* event1->refcount becomes 1 */                         .
  23        }                                                          .
  24        +- mutex_unlock(&child_ctx->mutex);                        . /* Acquired child's ctx->mutex */
  25                                                                   __pmu_detach_event() /* event2 */
  26                                                                     perf_event_exit_event() /* event2 */
  27                                                                       if (parent_event) { /* true, BUT FALSE POSITIVE. */
  28                                                                         if (revoke) /* true */
  29                                                                             put_event(); /* event2 */
  30                                                                             /* event2->refcount becomes 1 */
  31                                                                         put_event(parent_event); /* event1 */
  32                                                                         /* event1->refcount becomes 0 */

                                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^

This can manifest as some random bug. I guess, perf_event_exit_event()
should also check for (event->attach_state & PERF_ATTACH_CHILD) when
event->parent != NULL.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-05 10:21 ` [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable Peter Zijlstra
  2025-02-10  6:39   ` Ravi Bangoria
  2025-02-10  6:42   ` Ravi Bangoria
@ 2025-02-10  6:59   ` Ravi Bangoria
  2025-02-13 13:07     ` Peter Zijlstra
  2 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-10  6:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang, Ravi Bangoria

On 05-Feb-25 3:51 PM, Peter Zijlstra wrote:
> Previously it was only safe to call perf_pmu_unregister() if there
> were no active events of that pmu around -- which was impossible to
> guarantee since it races all sorts against perf_init_event().
> 
> Rework the whole thing by:
> 
>  - keeping track of all events for a given pmu
> 
>  - 'hiding' the pmu from perf_init_event()
> 
>  - waiting for the appropriate (s)rcu grace periods such that all
>    prior references to the PMU will be completed
> 
>  - detaching all still existing events of that pmu (see first point)
>    and moving them to a new REVOKED state.
> 
>  - actually freeing the pmu data.
> 
> Where notably the new REVOKED state must inhibit all event actions
> from reaching code that wants to use event->pmu.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Another race between perf_event_init_task() failure path and
perf_pmu_unregister():

        CPU 1                                          CPU 2

  perf_event_init_task()
    perf_event_free_task()
      perf_free_event(event1)
      /* event1->refcount is 1 */
                                                  perf_pmu_unregister()
                                                    pmu_detach_events()
                                                      pmu_get_event(pmu) /* Picks event1 */
                                                        atomic_long_inc_not_zero(&event1->refcount) /* event1 */
      /* event1->refcount became 2 (by CPU 2) */
        free_event(event1)
          WARN()

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-10  6:39   ` Ravi Bangoria
@ 2025-02-11 15:46     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-11 15:46 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: mingo@kernel.org, lucas.demarchi@intel.com,
	linux-kernel@vger.kernel.org, willy@infradead.org,
	acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	kan.liang@linux.intel.com

On Mon, Feb 10, 2025 at 12:09:18PM +0530, Ravi Bangoria wrote:
> Hi Peter,
> 
> > @@ -1737,7 +1742,7 @@ static inline bool needs_branch_stack(st
> >  
> >  static inline bool has_aux(struct perf_event *event)
> >  {
> > -	return event->pmu->setup_aux;
> > +	return event->pmu && event->pmu->setup_aux;
> >  }
> 
> this ...

Bah, I'll try and trace that.

> > @@ -6412,7 +6444,7 @@ static void perf_mmap_open(struct vm_are
> >  	if (vma->vm_pgoff)
> >  		atomic_inc(&event->rb->aux_mmap_count);
> >  
> > -	if (event->pmu->event_mapped)
> > +	if (event->pmu && event->pmu->event_mapped)
> >  		event->pmu->event_mapped(event, vma->vm_mm);
> >  }
> >  
> > @@ -6435,7 +6467,8 @@ static void perf_mmap_close(struct vm_ar
> >  	unsigned long size = perf_data_size(rb);
> >  	bool detach_rest = false;
> >  
> > -	if (event->pmu->event_unmapped)
> > +	/* FIXIES vs perf_pmu_unregister() */
> > +	if (event->pmu && event->pmu->event_unmapped)
> >  		event->pmu->event_unmapped(event, vma->vm_mm);
> 
> these two ...
> 
> > @@ -6837,7 +6880,7 @@ static int perf_mmap(struct file *file,
> >  	if (!ret)
> >  		ret = map_range(rb, vma);
> >  
> > -	if (!ret && event->pmu->event_mapped)
> > +	if (!ret && event->pmu && event->pmu->event_mapped)
> >  		event->pmu->event_mapped(event, vma->vm_mm);
> 
> ... and this one.

These are relatively simple to fix, something like so. This relies on
the fact that if there's mapped functions the whole unregister thing
won't happen.

---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6432,9 +6432,22 @@ void ring_buffer_put(struct perf_buffer
 	call_rcu(&rb->rcu_head, rb_free_rcu);
 }
 
+typedef void (*mapped_f)(struct perf_event *event, struct mm_struct *mm);
+
+#define get_mapped(event, func)			\
+({	struct pmu *pmu;			\
+	mapped_f f = NULL;			\
+	guard(rcu)();				\
+	pmu = READ_ONCE(event->pmu);		\
+	if (pmu)				\
+		f = pmu->func;			\
+	f;					\
+})
+
 static void perf_mmap_open(struct vm_area_struct *vma)
 {
 	struct perf_event *event = vma->vm_file->private_data;
+	mapped_f mapped = get_mapped(event, event_mapped);
 
 	atomic_inc(&event->mmap_count);
 	atomic_inc(&event->rb->mmap_count);
@@ -6442,8 +6455,8 @@ static void perf_mmap_open(struct vm_are
 	if (vma->vm_pgoff)
 		atomic_inc(&event->rb->aux_mmap_count);
 
-	if (event->pmu && event->pmu->event_mapped)
-		event->pmu->event_mapped(event, vma->vm_mm);
+	if (mapped)
+		mapped(event, vma->vm_mm);
 }
 
 static void perf_pmu_output_stop(struct perf_event *event);
@@ -6459,6 +6472,7 @@ static void perf_pmu_output_stop(struct
 static void perf_mmap_close(struct vm_area_struct *vma)
 {
 	struct perf_event *event = vma->vm_file->private_data;
+	mapped_f unmapped = get_mapped(event, event_unmapped);
 	struct perf_buffer *rb = ring_buffer_get(event);
 	struct user_struct *mmap_user = rb->mmap_user;
 	int mmap_locked = rb->mmap_locked;
@@ -6466,8 +6480,8 @@ static void perf_mmap_close(struct vm_ar
 	bool detach_rest = false;
 
 	/* FIXIES vs perf_pmu_unregister() */
-	if (event->pmu && event->pmu->event_unmapped)
-		event->pmu->event_unmapped(event, vma->vm_mm);
+	if (unmapped)
+		unmapped(event, vma->vm_mm);
 
 	/*
 	 * The AUX buffer is strictly a sub-buffer, serialize using aux_mutex
@@ -6660,6 +6674,7 @@ static int perf_mmap(struct file *file,
 	unsigned long nr_pages;
 	long user_extra = 0, extra = 0;
 	int ret, flags = 0;
+	mapped_f mapped;
 
 	/*
 	 * Don't allow mmap() of inherited per-task counters. This would
@@ -6878,8 +6893,9 @@ static int perf_mmap(struct file *file,
 	if (!ret)
 		ret = map_range(rb, vma);
 
-	if (!ret && event->pmu && event->pmu->event_mapped)
-		event->pmu->event_mapped(event, vma->vm_mm);
+	mapped = get_mapped(event, event_mapped);
+	if (mapped)
+		mapped(event, vma->vm_mm);
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-10  6:42   ` Ravi Bangoria
@ 2025-02-12 12:49     ` Peter Zijlstra
  2025-02-13  7:52       ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-12 12:49 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang

On Mon, Feb 10, 2025 at 12:12:40PM +0530, Ravi Bangoria wrote:
> > @@ -13467,7 +13665,13 @@ perf_event_exit_event(struct perf_event
> >  		 * Kick perf_poll() for is_event_hup();
> >  		 */
> >  		perf_event_wakeup(parent_event);
> > -		free_event(event);
> > +		/*
> > +		 * pmu_detach_event() will have an extra refcount.
> > +		 */
> > +		if (revoke)
> > +			put_event(event);
> > +		else
> > +			free_event(event);
> >  		put_event(parent_event);
> >  		return;
> >  	}
> 
> There is a race between do_exit() and perf_pmu_unregister():
> 
> Assume: a Parent process with 'event1' and a Child process with an
> inherited 'event2'.
> 
>                   CPU 1                                                       CPU 2
> 
>       Child process exiting ...
>    1  do_exit()
>    2    perf_event_exit_task()
>    3      perf_event_exit_task_context()
>    4        mutex_lock(&child_ctx->mutex);
>    5        list_for_each_entry_safe(&child_ctx->event_list)
>    6        /* picks event2. event2->refcount is 1 */
>    7                                                             perf_pmu_unregister()
>    8                                                               pmu_detach_events()
>    9                                                                 pmu_get_event(pmu) /* Picks event2 */
>   10                                                                   atomic_long_inc_not_zero(&event->refcount) /* event2 */
>   11                                                                   /* event2->refcount becomes 2 */
>   12                                                                 pmu_detach_event() /* event2 */
>   13                                                                   perf_event_ctx_lock(); /* event2 */
>   14        /* event2->refcount became 2 (by CPU 2) */                 .
>   15      perf_event_exit_event() /* event2 */                         . /* CPU 1 is holding ctx->mutex of
>   16        if (parent_event) { /* true */                             .    child process. Waiting ... */
>   17          if (revoke) /* false */                                  .
>   18          else                                                     .
>   19              free_event();  /* event2 */                          .
>   20                  WARN()                                           .
> 
>                       ^^^^^^
> 
> This race manifests as:
> 
>   unexpected event refcount: 2; ptr=00000000c6ca2049
>   WARNING: CPU: 57 PID: 9729 at kernel/events/core.c:5439 free_event+0x48/0x60
>   RIP: 0010:free_event+0x48/0x60
>   Call Trace:
>    ? free_event+0x48/0x60
>    perf_event_exit_event.isra.0+0x60/0xf0
>    perf_event_exit_task+0x1e1/0x280
>    do_exit+0x327/0xb00
> 
> The race continues... now, with the stale child2->parent pointer:
> 
>                   CPU 1                                                       CPU 2
> 
>   15      perf_event_exit_event() /* event2 */                         . /* CPU 1 is holding ctx->mutex of
>   16        if (parent_event) { /* true */                             .    child process. Waiting ... */
>   17          if (revoke) /* false */                                  .
>   18          else                                                     .
>   19              free_event();  /* event2 */                          .
>   20                  WARN()                                           .
>   21          put_event(parent_event); /* event1 */                    .
>   22          /* event1->refcount becomes 1 */                         .
>   23        }                                                          .
>   24        +- mutex_unlock(&child_ctx->mutex);                        . /* Acquired child's ctx->mutex */
>   25                                                                   __pmu_detach_event() /* event2 */
>   26                                                                     perf_event_exit_event() /* event2 */
>   27                                                                       if (parent_event) { /* true, BUT FALSE POSITIVE. */
>   28                                                                         if (revoke) /* true */
>   29                                                                             put_event(); /* event2 */
>   30                                                                             /* event2->refcount becomes 1 */
>   31                                                                         put_event(parent_event); /* event1 */
>   32                                                                         /* event1->refcount becomes 0 */
> 
>                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This can manifest as some random bug. I guess, perf_event_exit_event()
> should also check for (event->attach_state & PERF_ATTACH_CHILD) when
> event->parent != NULL.

Does this work?

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2303,6 +2303,7 @@ static void perf_child_detach(struct per
 
 	sync_child_event(event);
 	list_del_init(&event->child_list);
+	event->parent = NULL;
 }
 
 static bool is_orphaned_event(struct perf_event *event)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-12 12:49     ` Peter Zijlstra
@ 2025-02-13  7:52       ` Ravi Bangoria
  2025-02-13 13:08         ` Peter Zijlstra
  2025-02-14 20:24         ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-13  7:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang, Ravi Bangoria

> Does this work?
> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2303,6 +2303,7 @@ static void perf_child_detach(struct per
>  
>  	sync_child_event(event);
>  	list_del_init(&event->child_list);
> +	event->parent = NULL;
>  }
>  
>  static bool is_orphaned_event(struct perf_event *event)

Apparently not, it ends up with:

  ------------[ cut here ]------------
  WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0
  Modules linked in: [...]
  CPU: 145 UID: 1002 PID: 5459 Comm: perf_fuzzer Kdump: loaded Not tainted 6.14.0-rc1-pmu-unregister+ #244
  Hardware name: AMD Corporation RUBY/RUBY, BIOS RRR1009C 07/21/2023
  RIP: 0010:event_function+0xd2/0xf0
  Code: [...]
  RSP: 0018:ffa000000647fac8 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ff1100100d4b0ee0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffa000000647faf0 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffa000000647fbb8
  R13: 0000000000000000 R14: ff11000104cc9e00 R15: ff110001216fcec0
  FS:  000070acabcae740(0000) GS:ff1100100d480000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffffff600000 CR3: 000000011d920005 CR4: 0000000000f71ff0
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? show_regs+0x6c/0x80
   ? __warn+0x8d/0x150
   ? event_function+0xd2/0xf0
   ? report_bug+0x182/0x1b0
   ? handle_bug+0x6e/0xb0
   ? exc_invalid_op+0x18/0x80
   ? asm_exc_invalid_op+0x1b/0x20
   ? event_function+0xd2/0xf0
   ? event_function+0x3d/0xf0
   remote_function+0x4f/0x70
   ? __pfx_remote_function+0x10/0x10
   generic_exec_single+0x7f/0x160
   smp_call_function_single+0x110/0x160
   ? __pfx_remote_function+0x10/0x10
   ? ktime_get_ts64+0x49/0x150
   event_function_call+0x98/0x1d0
   ? __pfx___perf_event_disable+0x10/0x10
   ? __pfx___perf_event_disable+0x10/0x10
   ? __pfx_event_function+0x10/0x10
   ? __pfx__perf_event_disable+0x10/0x10
   _perf_event_disable+0x41/0x70
   perf_event_for_each_child+0x40/0x90
   ? __pfx__perf_event_disable+0x10/0x10
   _perf_ioctl+0xac/0xb00
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? place_entity+0xa7/0x170
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x2apic_send_IPI_dest+0x32/0x50
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? x2apic_send_IPI+0x26/0x40
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? native_send_call_func_single_ipi+0x13/0x20
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __smp_call_single_queue+0xf7/0x180
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? generic_exec_single+0x38/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? smp_call_function_single_async+0x22/0x50
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? mutex_lock+0x12/0x50
   perf_ioctl+0x45/0x80
   __x64_sys_ioctl+0xa7/0xe0
   x64_sys_call+0x131e/0x2650
   do_syscall_64+0x7e/0x170
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rcu_core+0x1d1/0x3a0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rcu_core_si+0xe/0x20
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_softirqs+0xe7/0x330
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? irqentry_exit_to_user_mode+0x43/0x250
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? irqentry_exit+0x43/0x50
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? sysvec_apic_timer_interrupt+0x4f/0xc0
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x70acabb24ded
  Code: [...]
  RSP: 002b:00007fffb3dd4a40 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fffb3dd6f28 RCX: 000070acabb24ded
  RDX: 0000000066f26a61 RSI: 0000000000002401 RDI: 0000000000000013
  RBP: 00007fffb3dd4a90 R08: 000070acabc03084 R09: 000070acabc030a0
  R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
  R13: 0000000000000000 R14: 00005e38a9ec5b60 R15: 000070acabd00000
   </TASK>
  ---[ end trace 0000000000000000 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:286 event_function+0xd6/0xf0
  Modules linked in: [...]
  CPU: 145 UID: 1002 PID: 5459 Comm: perf_fuzzer Kdump: loaded Tainted: G        W          6.14.0-rc1-pmu-unregister+ #244
  Tainted: [W]=WARN
  Hardware name: AMD Corporation RUBY/RUBY, BIOS RRR1009C 07/21/2023
  RIP: 0010:event_function+0xd6/0xf0
  Code: [...]
  RSP: 0018:ffa000000647fac8 EFLAGS: 00010086
  RAX: 0000000000000000 RBX: ff1100100d4b0ee0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffa000000647faf0 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffa000000647fbb8
  R13: 0000000000000000 R14: ff11000104cc9e00 R15: ff110001216fcec0
  FS:  000070acabcae740(0000) GS:ff1100100d480000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffffff600000 CR3: 000000011d920005 CR4: 0000000000f71ff0
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? show_regs+0x6c/0x80
   ? __warn+0x8d/0x150
   ? event_function+0xd6/0xf0
   ? report_bug+0x182/0x1b0
   ? handle_bug+0x6e/0xb0
   ? exc_invalid_op+0x18/0x80
   ? asm_exc_invalid_op+0x1b/0x20
   ? event_function+0xd6/0xf0
   ? event_function+0x3d/0xf0
   remote_function+0x4f/0x70
   ? __pfx_remote_function+0x10/0x10
   generic_exec_single+0x7f/0x160
   smp_call_function_single+0x110/0x160
   ? __pfx_remote_function+0x10/0x10
   ? ktime_get_ts64+0x49/0x150
   event_function_call+0x98/0x1d0
   ? __pfx___perf_event_disable+0x10/0x10
   ? __pfx___perf_event_disable+0x10/0x10
   ? __pfx_event_function+0x10/0x10
   ? __pfx__perf_event_disable+0x10/0x10
   _perf_event_disable+0x41/0x70
   perf_event_for_each_child+0x40/0x90
   ? __pfx__perf_event_disable+0x10/0x10
   _perf_ioctl+0xac/0xb00
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? place_entity+0xa7/0x170
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x2apic_send_IPI_dest+0x32/0x50
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? x2apic_send_IPI+0x26/0x40
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? native_send_call_func_single_ipi+0x13/0x20
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __smp_call_single_queue+0xf7/0x180
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? generic_exec_single+0x38/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? smp_call_function_single_async+0x22/0x50
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? mutex_lock+0x12/0x50
   perf_ioctl+0x45/0x80
   __x64_sys_ioctl+0xa7/0xe0
   x64_sys_call+0x131e/0x2650
   do_syscall_64+0x7e/0x170
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rcu_core+0x1d1/0x3a0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rcu_core_si+0xe/0x20
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_softirqs+0xe7/0x330
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? irqentry_exit_to_user_mode+0x43/0x250
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? irqentry_exit+0x43/0x50
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? sysvec_apic_timer_interrupt+0x4f/0xc0
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x70acabb24ded
  Code: [...]
  RSP: 002b:00007fffb3dd4a40 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fffb3dd6f28 RCX: 000070acabb24ded
  RDX: 0000000066f26a61 RSI: 0000000000002401 RDI: 0000000000000013
  RBP: 00007fffb3dd4a90 R08: 000070acabc03084 R09: 000070acabc030a0
  R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
  R13: 0000000000000000 R14: 00005e38a9ec5b60 R15: 000070acabd00000
   </TASK>
  ---[ end trace 0000000000000000 ]---
  watchdog: BUG: soft lockup - CPU#88 stuck for 26s! [perf_fuzzer:5740]
  Modules linked in: [...]
  CPU: 88 UID: 1002 PID: 5740 Comm: perf_fuzzer Kdump: loaded Tainted: G        W          6.14.0-rc1-pmu-unregister+ #244
  Tainted: [W]=WARN
  Hardware name: AMD Corporation RUBY/RUBY, BIOS RRR1009C 07/21/2023
  RIP: 0010:smp_call_function_single+0x11f/0x160
  Code: [...]
  RSP: 0000:ffa0000004b57ba0 EFLAGS: 00000202
  RAX: 0000000000000000 RBX: ff11000153ad8000 RCX: 0000000000000000
  RDX: 0000000000000011 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffa0000004b57be8 R08: 0000000000000000 R09: ffa0000004b57c20
  R10: 0000000000000001 R11: ffffffff81621ef0 R12: ff11000104cc9e00
  R13: ffa0000004b57c08 R14: ff1100013732af40 R15: ff1100100b830ee0
  FS:  0000000000000000(0000) GS:ff1100100b800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00005e38a9ea8930 CR3: 0000000003440004 CR4: 0000000000f71ef0
  PKRU: 55555554
  Call Trace:
   <IRQ>
   ? show_regs+0x6c/0x80
   ? watchdog_timer_fn+0x22b/0x2d0
   ? __pfx_watchdog_timer_fn+0x10/0x10
   ? __hrtimer_run_queues+0x112/0x2a0
   ? clockevents_program_event+0xc1/0x150
   ? hrtimer_interrupt+0xfd/0x260
   ? __sysvec_apic_timer_interrupt+0x56/0x130
   ? sysvec_apic_timer_interrupt+0x93/0xc0
   </IRQ>
   <TASK>
   ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
   ? __pfx_remote_function+0x10/0x10
   ? smp_call_function_single+0x11f/0x160
   ? __pfx_remote_function+0x10/0x10
   ? event_function_call+0x115/0x1d0
   ? srso_alias_return_thunk+0x5/0xfbef5
   event_function_call+0x98/0x1d0
   ? __pfx___perf_remove_from_context+0x10/0x10
   ? __pfx___perf_remove_from_context+0x10/0x10
   ? __pfx_event_function+0x10/0x10
   perf_remove_from_context+0x46/0xa0
   perf_event_release_kernel+0x1f3/0x210
   perf_release+0x12/0x20
   __fput+0xed/0x2d0
   ____fput+0x15/0x20
   task_work_run+0x60/0xa0
   do_exit+0x317/0xb00
   ? srso_alias_return_thunk+0x5/0xfbef5
   do_group_exit+0x34/0x90
   get_signal+0x934/0x940
   arch_do_signal_or_restart+0x2f/0x250
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? event_function+0xa4/0xf0
   ? __pfx_remote_function+0x10/0x10
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? remote_function+0x4f/0x70
   ? srso_alias_return_thunk+0x5/0xfbef5
   irqentry_exit_to_user_mode+0x1e0/0x250
   irqentry_exit+0x43/0x50
   sysvec_reschedule_ipi+0x5d/0x100
   asm_sysvec_reschedule_ipi+0x1b/0x20
  RIP: 0033:0x5e38a9eb87dd
  Code: Unable to access opcode bytes at 0x5e38a9eb87b3.
  RSP: 002b:00007fffb3dd4ae8 EFLAGS: 00000202
  RAX: 0000000000001553 RBX: 00007fffb3dd6f28 RCX: 00000000000020da
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
  RBP: 00007fffb3dd4b00 R08: 0000000000000001 R09: 000070acabcae740
  R10: 000070acabd010b8 R11: 0000000000000246 R12: 0000000000000001
  R13: 0000000000000000 R14: 00005e38a9ec5b60 R15: 000070acabd00000
   </TASK>


Something like below instead? I haven't tested it thoroughly though.

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d2b87a425e75..4e131b1c37ad 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13645,20 +13645,25 @@ perf_event_exit_event(struct perf_event *event,
 	unsigned long detach_flags = DETACH_EXIT;
 
 	if (parent_event) {
-		/*
-		 * Do not destroy the 'original' grouping; because of the
-		 * context switch optimization the original events could've
-		 * ended up in a random child task.
-		 *
-		 * If we were to destroy the original group, all group related
-		 * operations would cease to function properly after this
-		 * random child dies.
-		 *
-		 * Do destroy all inherited groups, we don't care about those
-		 * and being thorough is better.
-		 */
-		detach_flags |= DETACH_GROUP | DETACH_CHILD;
 		mutex_lock(&parent_event->child_mutex);
+		if (event->attach_state & PERF_ATTACH_CHILD) {
+			/*
+			 * Do not destroy the 'original' grouping; because of the
+			 * context switch optimization the original events could've
+			 * ended up in a random child task.
+			 *
+			 * If we were to destroy the original group, all group related
+			 * operations would cease to function properly after this
+			 * random child dies.
+			 *
+			 * Do destroy all inherited groups, we don't care about those
+			 * and being thorough is better.
+			 */
+			detach_flags |= DETACH_GROUP | DETACH_CHILD;
+		} else {
+			mutex_unlock(&parent_event->child_mutex);
+			parent_event = NULL;
+		}
 	}
 
 	if (revoke)
---

Thanks,
Ravi

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-10  6:59   ` Ravi Bangoria
@ 2025-02-13 13:07     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-13 13:07 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang

On Mon, Feb 10, 2025 at 12:29:21PM +0530, Ravi Bangoria wrote:
> On 05-Feb-25 3:51 PM, Peter Zijlstra wrote:
> > Previously it was only safe to call perf_pmu_unregister() if there
> > were no active events of that pmu around -- which was impossible to
> > guarantee since it races all sorts against perf_init_event().
> > 
> > Rework the whole thing by:
> > 
> >  - keeping track of all events for a given pmu
> > 
> >  - 'hiding' the pmu from perf_init_event()
> > 
> >  - waiting for the appropriate (s)rcu grace periods such that all
> >    prior references to the PMU will be completed
> > 
> >  - detaching all still existing events of that pmu (see first point)
> >    and moving them to a new REVOKED state.
> > 
> >  - actually freeing the pmu data.
> > 
> > Where notably the new REVOKED state must inhibit all event actions
> > from reaching code that wants to use event->pmu.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Another race between perf_event_init_task() failure path and
> perf_pmu_unregister():
> 
>         CPU 1                                          CPU 2
> 
>   perf_event_init_task()
>     perf_event_free_task()
>       perf_free_event(event1)
>       /* event1->refcount is 1 */
>                                                   perf_pmu_unregister()
>                                                     pmu_detach_events()
>                                                       pmu_get_event(pmu) /* Picks event1 */
>                                                         atomic_long_inc_not_zero(&event1->refcount) /* event1 */
>       /* event1->refcount became 2 (by CPU 2) */
>         free_event(event1)
>           WARN()

I've unified perf_event_free_task() and perf_event_exit_task(). This
makes both use perf_event_exit_event() and as such should no longer have
this free_event().



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-13  7:52       ` Ravi Bangoria
@ 2025-02-13 13:08         ` Peter Zijlstra
  2025-02-14  3:57           ` Ravi Bangoria
  2025-02-14 20:24         ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-13 13:08 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang

On Thu, Feb 13, 2025 at 01:22:55PM +0530, Ravi Bangoria wrote:
> > Does this work?
> > 
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2303,6 +2303,7 @@ static void perf_child_detach(struct per
> >  
> >  	sync_child_event(event);
> >  	list_del_init(&event->child_list);
> > +	event->parent = NULL;
> >  }
> >  
> >  static bool is_orphaned_event(struct perf_event *event)
> 
> Apparently not, it ends up with:
> 
>   ------------[ cut here ]------------
>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0

Durr, do you have an updated test case?

> Something like below instead? I haven't tested it thoroughly though.
> 
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d2b87a425e75..4e131b1c37ad 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -13645,20 +13645,25 @@ perf_event_exit_event(struct perf_event *event,
>  	unsigned long detach_flags = DETACH_EXIT;
>  
>  	if (parent_event) {
> -		/*
> -		 * Do not destroy the 'original' grouping; because of the
> -		 * context switch optimization the original events could've
> -		 * ended up in a random child task.
> -		 *
> -		 * If we were to destroy the original group, all group related
> -		 * operations would cease to function properly after this
> -		 * random child dies.
> -		 *
> -		 * Do destroy all inherited groups, we don't care about those
> -		 * and being thorough is better.
> -		 */
> -		detach_flags |= DETACH_GROUP | DETACH_CHILD;
>  		mutex_lock(&parent_event->child_mutex);
> +		if (event->attach_state & PERF_ATTACH_CHILD) {
> +			/*
> +			 * Do not destroy the 'original' grouping; because of the
> +			 * context switch optimization the original events could've
> +			 * ended up in a random child task.
> +			 *
> +			 * If we were to destroy the original group, all group related
> +			 * operations would cease to function properly after this
> +			 * random child dies.
> +			 *
> +			 * Do destroy all inherited groups, we don't care about those
> +			 * and being thorough is better.
> +			 */
> +			detach_flags |= DETACH_GROUP | DETACH_CHILD;
> +		} else {
> +			mutex_unlock(&parent_event->child_mutex);
> +			parent_event = NULL;
> +		}
>  	}

Yeah, that might do, but not really nice. But perhaps its the best we
can do. I'll give it some thought.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-13 13:08         ` Peter Zijlstra
@ 2025-02-14  3:57           ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-14  3:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang, Ravi Bangoria

On 13-Feb-25 6:38 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2025 at 01:22:55PM +0530, Ravi Bangoria wrote:
>>> Does this work?
>>>
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -2303,6 +2303,7 @@ static void perf_child_detach(struct per
>>>  
>>>  	sync_child_event(event);
>>>  	list_del_init(&event->child_list);
>>> +	event->parent = NULL;
>>>  }
>>>  
>>>  static bool is_orphaned_event(struct perf_event *event)
>>
>> Apparently not, it ends up with:
>>
>>   ------------[ cut here ]------------
>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0
> 
> Durr, do you have an updated test case?

This one triggered with just perf_fuzzer run (without my tinypmu test).

But in general, to test the whole series, I ran perf_fuzzer simultaneously
with my tinypmu tests:

  term1~$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid; while true; do ./perf_fuzzer; done
  term2~$ cd ~/tinypmu; make clean; make; while true; do sudo bash tinypmu-u.sh; done

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-13  7:52       ` Ravi Bangoria
  2025-02-13 13:08         ` Peter Zijlstra
@ 2025-02-14 20:24         ` Peter Zijlstra
  2025-02-17  8:24           ` Ravi Bangoria
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-14 20:24 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: mingo, lucas.demarchi, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang

On Thu, Feb 13, 2025 at 01:22:55PM +0530, Ravi Bangoria wrote:
> Apparently not, it ends up with:
> 
>   ------------[ cut here ]------------
>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0
>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:286 event_function+0xd6/0xf0

>    remote_function+0x4f/0x70
>    generic_exec_single+0x7f/0x160
>    smp_call_function_single+0x110/0x160
>    event_function_call+0x98/0x1d0
>    _perf_event_disable+0x41/0x70
>    perf_event_for_each_child+0x40/0x90
>    _perf_ioctl+0xac/0xb00
>    perf_ioctl+0x45/0x80

Took me a long while trying to blame this on the 'event->parent =
NULL;', but AFAICT this is a new, unrelated issue.

What I think happens is this perf_ioctl(DISABLE) vs pmu_detach_events()
race, where the crux is that perf_ioctl() path does not take
event2->mutex which allows the following interleave:


	   event1  <---> ctx1
	    |  ^
 child_list |  | parent
	    v  |
	   event2  <---> ctx2


perf_ioctl()
  perf_event_ctx_lock(event1)
    get_ctx(ctx1)
    mutex_lock(ctx1->mutex)
  _perf_ioctk()
    perf_event_for_each_child()
     mutex_lock(event1->child_mutex)
     _perf_event_disable(event1)
     _perf_event_disable(event2)
       raw_spin_lock_irq(ctx2->lock)
       raw_spin_unlock_irq()
       event_function_call(event2, __perf_event_disable)
         task_function_call()
           <IPI __perf_event_disable>


pmu_detach_events()
  event2 = pmu_get_event() <-- inc(event2->refcount);
  pmu_detach_event(event2)
    perf_event_ctx_lock(event2)
      get_ctx(ctx2)
      mutex_lock(ctx2->lock)
    __pmu_detach_event()
      perf_event_exit_event()
        mutex_lock(event1->child_mutex)
        perf_remove_from_context(event2, EXIT|GROUP|CHILD|REVOKE)
          lockdep_assert_held(ctx2->mutex)
          raw_spin_lock_irq(ctx2->lock)
          raw_spin_unlock_irq(ctx2->lock)
          event_function_call(event2, __perf_remove_from_context)
            task_function_call(event_function)
	      <IPI __perf_remove_from_context>
                remote_function()
                  event_function()
                    perf_ctx_lock(cpuctx, ctx2)
                      raw_spin_lock(ctx2->lock)
                    __perf_remove_from_context(event2)
                    event_sched_out()
                    perf_group_detach()
                    perf_child_detach()
                    list_del_event()
                    event->state = REVOKED;
                    cpc->task_epc = NULL; // event2 is last
                    ctx->is_active = 0;         <--.
                    cpuctx->task_ctx = NULL;       |
                                                   |
                                                   |
           <IPI __perf_event_disable>              |
             remote_function()                     |
               event_function()                    |
                 perf_ctx_lock(cpuctx, ctx2)       |
                   raw_spin_lock(ctx2->lock)       |
                                                   |
                 WARN_ON_ONCE(!ctx2->is_active)   -'
                 WARN_ON_ONCE(cpuctx->task_ctx != ctx2)


Still trying to work out how best to avoid this.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-14 20:24         ` Peter Zijlstra
@ 2025-02-17  8:24           ` Ravi Bangoria
  2025-02-17 16:31             ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-17  8:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@kernel.org, lucas.demarchi@intel.com,
	linux-kernel@vger.kernel.org, willy@infradead.org,
	acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	kan.liang@linux.intel.com, Bangoria, Ravikumar

Hi Peter,

>> Apparently not, it ends up with:
>>
>>   ------------[ cut here ]------------
>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0
>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:286 event_function+0xd6/0xf0
> 
>>    remote_function+0x4f/0x70
>>    generic_exec_single+0x7f/0x160
>>    smp_call_function_single+0x110/0x160
>>    event_function_call+0x98/0x1d0
>>    _perf_event_disable+0x41/0x70
>>    perf_event_for_each_child+0x40/0x90
>>    _perf_ioctl+0xac/0xb00
>>    perf_ioctl+0x45/0x80
> 
> Took me a long while trying to blame this on the 'event->parent =
> NULL;', but AFAICT this is a new, unrelated issue.
> 
> What I think happens is this perf_ioctl(DISABLE) vs pmu_detach_events()
> race, where the crux is that perf_ioctl() path does not take
> event2->mutex which allows the following interleave:

This one was only with perf_fuzzer, so pmu_detach_events() code path was
not invoked.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-17  8:24           ` Ravi Bangoria
@ 2025-02-17 16:31             ` Ravi Bangoria
  2025-02-19 13:23               ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-17 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@kernel.org, lucas.demarchi@intel.com,
	linux-kernel@vger.kernel.org, willy@infradead.org,
	acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	kan.liang@linux.intel.com, Ravi Bangoria

On 17-Feb-25 1:54 PM, Ravi Bangoria wrote:
> Hi Peter,
> 
>>> Apparently not, it ends up with:
>>>
>>>   ------------[ cut here ]------------
>>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0
>>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:286 event_function+0xd6/0xf0
>>
>>>    remote_function+0x4f/0x70
>>>    generic_exec_single+0x7f/0x160
>>>    smp_call_function_single+0x110/0x160
>>>    event_function_call+0x98/0x1d0
>>>    _perf_event_disable+0x41/0x70
>>>    perf_event_for_each_child+0x40/0x90
>>>    _perf_ioctl+0xac/0xb00
>>>    perf_ioctl+0x45/0x80
>>
>> Took me a long while trying to blame this on the 'event->parent =
>> NULL;', but AFAICT this is a new, unrelated issue.
>>
>> What I think happens is this perf_ioctl(DISABLE) vs pmu_detach_events()
>> race, where the crux is that perf_ioctl() path does not take
>> event2->mutex which allows the following interleave:
> 
> This one was only with perf_fuzzer, so pmu_detach_events() code path was
> not invoked.

I think the issue is, unaccount_event() gets called for the child event
after the child is detached. Since event->parent is NULL, unaccount_event()
abruptly corrupts the perf_sched_work.

I haven't verified it. Will do it tomorrow.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-17 16:31             ` Ravi Bangoria
@ 2025-02-19 13:23               ` Ravi Bangoria
  2025-02-19 14:30                 ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-19 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@kernel.org, lucas.demarchi@intel.com,
	linux-kernel@vger.kernel.org, willy@infradead.org,
	acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	kan.liang@linux.intel.com, Ravi Bangoria

Hi Peter,

>>>> Apparently not, it ends up with:
>>>>
>>>>   ------------[ cut here ]------------
>>>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0
>>>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:286 event_function+0xd6/0xf0
>>>
>>>>    remote_function+0x4f/0x70
>>>>    generic_exec_single+0x7f/0x160
>>>>    smp_call_function_single+0x110/0x160
>>>>    event_function_call+0x98/0x1d0
>>>>    _perf_event_disable+0x41/0x70
>>>>    perf_event_for_each_child+0x40/0x90
>>>>    _perf_ioctl+0xac/0xb00
>>>>    perf_ioctl+0x45/0x80
>>>
>>> Took me a long while trying to blame this on the 'event->parent =
>>> NULL;', but AFAICT this is a new, unrelated issue.
>>>
>>> What I think happens is this perf_ioctl(DISABLE) vs pmu_detach_events()
>>> race, where the crux is that perf_ioctl() path does not take
>>> event2->mutex which allows the following interleave:
>>
>> This one was only with perf_fuzzer, so pmu_detach_events() code path was
>> not invoked.
> 
> I think the issue is, unaccount_event() gets called for the child event
> after the child is detached. Since event->parent is NULL, unaccount_event()
> abruptly corrupts the perf_sched_work.

A different manifestation of the same underlying issue (perf_sched_work
is getting corrupted because of event->parent = NULL):

  WARNING: CPU: 0 PID: 5799 at kernel/events/core.c:286 event_function+0xd6/0xf0
  CPU: 0 UID: 1002 PID: 5799 Comm: perf_fuzzer Kdump: loaded Not tainted 6.14.0-rc1-pmu-unregister+ #263
  RIP: 0010:event_function+0xd6/0xf0
  RSP: 0018:ffa0000006a87828 EFLAGS: 00010086
  RAX: 0000000000000007 RBX: ff11001008830ee0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffa0000006a87850 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffa0000006a87918
  R13: 0000000000000000 R14: ff1100010e347c00 R15: ff110001226c9a40
  FS:  0000000000000000(0000) GS:ff11001008800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000595fed7d3e12 CR3: 000000016954c005 CR4: 0000000000f71ef0
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? event_function+0xd6/0xf0
   ? event_function+0x3d/0xf0
   remote_function+0x4c/0x70
   generic_exec_single+0x7c/0x160
   smp_call_function_single+0x110/0x160
   event_function_call+0x98/0x1d0
   perf_remove_from_context+0x46/0xa0
   perf_event_release_kernel+0x1f3/0x210
   perf_release+0x12/0x20

With the above trace as reference, something like below is what is
happening. (Code paths are based on my understanding of traces, I've
not verified each an every bit :P)

  Assume:
  task1: event1
  task2 (child of task1): event2 (child of event1)
  task3: event3

              CPU 1                                             CPU 2                                                               CPU 3                                                       CPU 4
  
                                                  /* task3 sched in */
                                                  perf_event_task_sched_in()
                                                    if (static_branch_unlikely(&perf_sched_events)) /* true */
                                                        __perf_event_task_sched_in()
                                                            ...

  task2:
  perf_event_exit_event() /* event2 */
    event->parent = NULL /* by perf_perf_child_detach() */
    free_event()
      _free_event()
        unaccount_event()
          if (event->parent) /* false */
              return; /* won't get executed */
          perf_sched_work <== OFF  ------.
                                         |
                                         |        /* task3 sched out */
                                         |        perf_event_task_sched_out()
                                         '----->   if (static_branch_unlikely(&perf_sched_events)) /* false */
                                         |             /* __perf_event_task_sched_out() won't get called */
                                         |
                                         |                                                                            /* task3 sched in */
                                         |                                                                            perf_event_task_sched_in()
                                         '-------------------------------------------------------------------------->   if (static_branch_unlikely(&perf_sched_events)) /* false */
                                                                                                                             /* __perf_event_task_sched_in() won't get called
                                                                                                                             So cpuctx->task_ctx will remains NULL; */
  
                                                                                                                                                                                         close(event3)
                                                                                                                                                                                           perf_release()
                                                                                                                                                                                             perf_event_release_kernel()
                                                                                                                                                                                               perf_remove_from_context()
                                                                                                                                                                                                 event_function_call()
                                                                                                                                                                                                 /* IPI to CPU 3 ---> */
                                                                                                                      /* ---> IPI from CPU 4 */
                                                                                                                      event_function()
                                                                                                                        if (ctx->task) {
                                                                                                                            /* task_ctx is NULL whereas ctx is !NULL */
                                                                                                                            WARN_ON_ONCE(task_ctx != ctx);

                                                                                                                            ^^^^^^^^^^^^

IIUC, event->parent is the only way to detect whether the event is a
child or not, so it makes sense to preserve the ->parent pointer even
after the child is detached.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable
  2025-02-19 13:23               ` Ravi Bangoria
@ 2025-02-19 14:30                 ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2025-02-19 14:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@kernel.org, lucas.demarchi@intel.com,
	linux-kernel@vger.kernel.org, willy@infradead.org,
	acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	kan.liang@linux.intel.com, Ravi Bangoria

>>>>> Apparently not, it ends up with:
>>>>>
>>>>>   ------------[ cut here ]------------
>>>>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:281 event_function+0xd2/0xf0
>>>>>   WARNING: CPU: 145 PID: 5459 at kernel/events/core.c:286 event_function+0xd6/0xf0
>>>>
>>>>>    remote_function+0x4f/0x70
>>>>>    generic_exec_single+0x7f/0x160
>>>>>    smp_call_function_single+0x110/0x160
>>>>>    event_function_call+0x98/0x1d0
>>>>>    _perf_event_disable+0x41/0x70
>>>>>    perf_event_for_each_child+0x40/0x90
>>>>>    _perf_ioctl+0xac/0xb00
>>>>>    perf_ioctl+0x45/0x80
>>>>
>>>> Took me a long while trying to blame this on the 'event->parent =
>>>> NULL;', but AFAICT this is a new, unrelated issue.
>>>>
>>>> What I think happens is this perf_ioctl(DISABLE) vs pmu_detach_events()
>>>> race, where the crux is that perf_ioctl() path does not take
>>>> event2->mutex which allows the following interleave:
>>>
>>> This one was only with perf_fuzzer, so pmu_detach_events() code path was
>>> not invoked.
>>
>> I think the issue is, unaccount_event() gets called for the child event
>> after the child is detached. Since event->parent is NULL, unaccount_event()
>> abruptly corrupts the perf_sched_work.
> 
> A different manifestation of the same underlying issue (perf_sched_work
> is getting corrupted because of event->parent = NULL):

Duh! I meant perf_sched_events gets corrupted by invoking perf_sched_work.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 06/24] perf: Fix pmus_lock vs pmus_srcu ordering
  2025-02-05 10:21 ` [PATCH v2 06/24] perf: Fix pmus_lock vs pmus_srcu ordering Peter Zijlstra
@ 2025-02-27 16:59   ` Lucas De Marchi
  0 siblings, 0 replies; 47+ messages in thread
From: Lucas De Marchi @ 2025-02-27 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, ravi.bangoria, linux-kernel, willy, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	kan.liang, intel-xe

On Wed, Feb 05, 2025 at 11:21:26AM +0100, Peter Zijlstra wrote:
>Commit a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order")
>placed pmus_lock inside pmus_srcu, this makes perf_pmu_unregister()
>trip lockdep.
>
>Move the locking about such that only pmu_idr and pmus (list) are

s/about/around/ ?

>modified while holding pmus_lock. This avoids doing synchronize_srcu()
>while holding pmus_lock and all is well again.
>
>Fixes: a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order")
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Without this patch, I occasionally see the locking issue below, but it
doesn't always happen. I'm running with this patch and haven't seen the
issue since.

[  268.195731] ======================================================
[  268.201945] WARNING: possible circular locking dependency detected
[  268.208165] 6.14.0-rc4-xe+ #59 Tainted: G     U
[  268.213777] ------------------------------------------------------
[  268.219988] kmod-unbind/2509 is trying to acquire lock:
[  268.225238] ffffffff8b40b310 (&pmus_srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x21/0x270
[  268.233374]
                but task is already holding lock:
[  268.239241] ffffffff86381de8 (pmus_lock){+.+.}-{3:3}, at: perf_pmu_unregister+0x25/0x2f0
[  268.247372]
                which lock already depends on the new lock.

[  268.255577]
                the existing dependency chain (in reverse order) is:
[  268.263101]
                -> #1 (pmus_lock){+.+.}-{3:3}:
[  268.268717]        __mutex_lock+0x1a4/0x1300
[  268.273019]        mutex_lock_nested+0x1b/0x30
[  268.277489]        perf_swevent_init+0x15d/0x610
[  268.282142]        perf_try_init_event+0x111/0xba0
[  268.286971]        perf_event_alloc+0x1024/0x3eb0
[  268.291703]        __do_sys_perf_event_open+0x3b0/0x2460
[  268.297046]        __x64_sys_perf_event_open+0xbe/0x160
[  268.302305]        x64_sys_call+0x1eb0/0x2650
[  268.306689]        do_syscall_64+0x91/0x180
[  268.310918]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  268.316524]
                -> #0 (&pmus_srcu){.+.+}-{0:0}:
[  268.322215]        __lock_acquire+0x3339/0x6980
[  268.326777]        lock_sync+0xec/0x180
[  268.330645]        __synchronize_srcu+0xaf/0x270
[  268.335291]        synchronize_srcu+0x64/0x2b0
[  268.339776]        perf_pmu_unregister+0xed/0x2f0
[  268.344523]        xe_pmu_unregister+0x94/0xf0 [xe]
[  268.349604]        devm_action_release+0x49/0x80
[  268.354257]        devres_release_all+0x180/0x240
[  268.358996]        device_unbind_cleanup+0x1b/0x1b0
[  268.363910]        device_release_driver_internal+0x44a/0x5b0
[  268.369693]        device_driver_detach+0x36/0x50
[  268.374435]        unbind_store+0xe0/0x100
[  268.378571]        drv_attr_store+0x6a/0xc0
[  268.382782]        sysfs_kf_write+0x106/0x160
[  268.387176]        kernfs_fop_write_iter+0x359/0x550
[  268.392180]        vfs_write+0x64c/0x1030
[  268.396234]        ksys_write+0x11a/0x240
[  268.400284]        __x64_sys_write+0x72/0xc0
[  268.404587]        x64_sys_call+0x2c4/0x2650
[  268.408885]        do_syscall_64+0x91/0x180
[  268.413102]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  268.418708]
                other info that might help us debug this:

[  268.426746]  Possible unsafe locking scenario:

[  268.432715]        CPU0                    CPU1
[  268.437277]        ----                    ----
[  268.441842]   lock(pmus_lock);
[  268.444932]                                lock(&pmus_srcu);
[  268.450633]                                lock(pmus_lock);
[  268.456243]   sync(&pmus_srcu);
[  268.459419]
                 *** DEADLOCK ***

[  268.465375] 4 locks held by kmod-unbind/2509:
[  268.469779]  #0: ffff888135689420 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0x11a/0x240
[  268.477911]  #1: ffff8881107aee88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x221/0x550
[  268.486829]  #2: ffff888111cc41b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0xa2/0x5b0
[  268.496547]  #3: ffffffff86381de8 (pmus_lock){+.+.}-{3:3}, at: perf_pmu_unregister+0x25/0x2f0


Tested-by: Lucas De Marchi <lucas.demarchi@intel.com>

Lucas De Marchi

>---
> kernel/events/core.c |    4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>--- a/kernel/events/core.c
>+++ b/kernel/events/core.c
>@@ -11836,6 +11836,8 @@ void perf_pmu_unregister(struct pmu *pmu
> {
> 	mutex_lock(&pmus_lock);
> 	list_del_rcu(&pmu->entry);
>+	idr_remove(&pmu_idr, pmu->type);
>+	mutex_unlock(&pmus_lock);
>
> 	/*
> 	 * We dereference the pmu list under both SRCU and regular RCU, so
>@@ -11845,7 +11847,6 @@ void perf_pmu_unregister(struct pmu *pmu
> 	synchronize_rcu();
>
> 	free_percpu(pmu->pmu_disable_count);
>-	idr_remove(&pmu_idr, pmu->type);
> 	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
> 		if (pmu->nr_addr_filters)
> 			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
>@@ -11853,7 +11854,6 @@ void perf_pmu_unregister(struct pmu *pmu
> 		put_device(pmu->dev);
> 	}
> 	free_pmu_context(pmu);
>-	mutex_unlock(&pmus_lock);
> }
> EXPORT_SYMBOL_GPL(perf_pmu_unregister);
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 19/24] perf: Simplify perf_mmap() control flow
  2025-02-05 10:21 ` [PATCH v2 19/24] perf: Simplify perf_mmap() control flow Peter Zijlstra
@ 2025-03-03  5:39   ` Ravi Bangoria
  2025-03-03 11:19     ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-03-03  5:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	mingo, lucas.demarchi, Ravi Bangoria

Hi Peter,

Minor nit below:

> +			if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
> +				/*
> +				 * Raced against perf_mmap_close(); remove the
> +				 * event and try again.
> +				 */
> +				ring_buffer_attach(event, NULL);
> +				mutex_unlock(&event->mmap_mutex);
> +				goto again;
> +			}
> +

here ...

> +			rb = event->rb;
> +			goto unlock;
> +		}
> +
> +		user_extra = nr_pages + 1;
>  	} else {
>  		/*
>  		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
> @@ -6723,47 +6758,9 @@ static int perf_mmap(struct file *file,
>  
>  		atomic_set(&rb->aux_mmap_count, 1);
>  		user_extra = nr_pages;
> -
> -		goto accounting;
>  	}
>  
> -	/*
> -	 * If we have rb pages ensure they're a power-of-two number, so we
> -	 * can do bitmasks instead of modulo.
> -	 */
> -	if (nr_pages != 0 && !is_power_of_2(nr_pages))
> -		return -EINVAL;
> -
> -	if (vma_size != PAGE_SIZE * (1 + nr_pages))
> -		return -EINVAL;
> -
> -	WARN_ON_ONCE(event->ctx->parent_ctx);
> -again:
> -	mutex_lock(&event->mmap_mutex);
> -	if (event->rb) {
> -		if (data_page_nr(event->rb) != nr_pages) {
> -			ret = -EINVAL;
> -			goto unlock;
> -		}
> -
> -		if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
> -			/*
> -			 * Raced against perf_mmap_close(); remove the
> -			 * event and try again.
> -			 */
> -			ring_buffer_attach(event, NULL);
> -			mutex_unlock(&event->mmap_mutex);
> -			goto again;
> -		}
> -
>  		/* We need the rb to map pages. */

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This comment should also go up. Keeping it here is misleading.

> -		rb = event->rb;
> -		goto unlock;
> -	}
> -
> -	user_extra = nr_pages + 1;
> -
> -accounting:
>  	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
>  
>  	/*

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable
  2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
                   ` (23 preceding siblings ...)
  2025-02-05 10:21 ` [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable Peter Zijlstra
@ 2025-03-03  6:01 ` Ravi Bangoria
  24 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2025-03-03  6:01 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: lucas.demarchi, linux-kernel, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	Ravi Bangoria

Ingo, Peter,

On 05-Feb-25 3:51 PM, Peter Zijlstra wrote:
> Hi
> 
> New version of the perf_pmu_unregister() rework. A fair number of bugs
> squashed.  Much thanks to Ravi for writing a stress tester. The robots also
> managed to find a few problems.

I've seen more issues (in addition to what I've already reported) with
v2 as well (I didn't get time to debug those so I didn't report them
here). However, those additional issues appear only with pmu unregister
stress test + perf_fuzzer and IMO pmu unregister is a rare event and I
shouldn't block the inclusion of the series. So, with current set of
issues addressed:

Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 19/24] perf: Simplify perf_mmap() control flow
  2025-03-03  5:39   ` Ravi Bangoria
@ 2025-03-03 11:19     ` Ingo Molnar
  2025-03-03 13:36       ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2025-03-03 11:19 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Peter Zijlstra, linux-kernel, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	lucas.demarchi


* Ravi Bangoria <ravi.bangoria@amd.com> wrote:

> Hi Peter,
> 
> Minor nit below:
> 
> > +			if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
> > +				/*
> > +				 * Raced against perf_mmap_close(); remove the
> > +				 * event and try again.
> > +				 */
> > +				ring_buffer_attach(event, NULL);
> > +				mutex_unlock(&event->mmap_mutex);
> > +				goto again;
> > +			}
> > +
> 
> here ...
> 
> > +			rb = event->rb;
> > +			goto unlock;
> > +		}
> > +
> > +		user_extra = nr_pages + 1;
> >  	} else {
> >  		/*
> >  		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
> > @@ -6723,47 +6758,9 @@ static int perf_mmap(struct file *file,
> >  
> >  		atomic_set(&rb->aux_mmap_count, 1);
> >  		user_extra = nr_pages;
> > -
> > -		goto accounting;
> >  	}
> >  
> > -	/*
> > -	 * If we have rb pages ensure they're a power-of-two number, so we
> > -	 * can do bitmasks instead of modulo.
> > -	 */
> > -	if (nr_pages != 0 && !is_power_of_2(nr_pages))
> > -		return -EINVAL;
> > -
> > -	if (vma_size != PAGE_SIZE * (1 + nr_pages))
> > -		return -EINVAL;
> > -
> > -	WARN_ON_ONCE(event->ctx->parent_ctx);
> > -again:
> > -	mutex_lock(&event->mmap_mutex);
> > -	if (event->rb) {
> > -		if (data_page_nr(event->rb) != nr_pages) {
> > -			ret = -EINVAL;
> > -			goto unlock;
> > -		}
> > -
> > -		if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
> > -			/*
> > -			 * Raced against perf_mmap_close(); remove the
> > -			 * event and try again.
> > -			 */
> > -			ring_buffer_attach(event, NULL);
> > -			mutex_unlock(&event->mmap_mutex);
> > -			goto again;
> > -		}
> > -
> >  		/* We need the rb to map pages. */
> 
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This comment should also go up. Keeping it here is misleading.

Yeah, I did that in the conflict resolution (which conflicted in this 
very area), so the end result in tip:perf/core (or tip:master) should 
be fine as-is, correct?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 19/24] perf: Simplify perf_mmap() control flow
  2025-03-03 11:19     ` Ingo Molnar
@ 2025-03-03 13:36       ` Ravi Bangoria
  2025-03-04  8:44         ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2025-03-03 13:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	lucas.demarchi, Ravi Bangoria

Hi Ingo,

>>> -
>>>  		/* We need the rb to map pages. */
>>
>>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> This comment should also go up. Keeping it here is misleading.
> 
> Yeah, I did that in the conflict resolution (which conflicted in this 
> very area), so the end result in tip:perf/core (or tip:master) should 
> be fine as-is, correct?

Yes, tip:perf/core looks good.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 19/24] perf: Simplify perf_mmap() control flow
  2025-03-03 13:36       ` Ravi Bangoria
@ 2025-03-04  8:44         ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2025-03-04  8:44 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Peter Zijlstra, linux-kernel, willy, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	lucas.demarchi


* Ravi Bangoria <ravi.bangoria@amd.com> wrote:

> Hi Ingo,
> 
> >>> -
> >>>  		/* We need the rb to map pages. */
> >>
> >>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >> This comment should also go up. Keeping it here is misleading.
> > 
> > Yeah, I did that in the conflict resolution (which conflicted in this 
> > very area), so the end result in tip:perf/core (or tip:master) should 
> > be fine as-is, correct?
> 
> Yes, tip:perf/core looks good.

Great, thank you for checking!

I also picked up:

  bfd33e88addd perf/core: Fix perf_mmap() failure path

and have added your Reviewed-by tags to the series.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [tip: perf/core] perf/core: Clean up perf_try_init_event()
  2025-02-05 10:21 ` [PATCH v2 08/24] perf: Cleanup perf_try_init_event() Peter Zijlstra
@ 2025-03-05 11:29   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-03-05 11:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ingo Molnar, Ravi Bangoria, x86,
	linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     da02f54e81db2f7bf6af9d1d0cfc5b41ec6d0dcb
Gitweb:        https://git.kernel.org/tip/da02f54e81db2f7bf6af9d1d0cfc5b41ec6d0dcb
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 05 Feb 2025 11:21:28 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 05 Mar 2025 12:13:59 +01:00

perf/core: Clean up perf_try_init_event()

Make sure that perf_try_init_event() doesn't leave event->pmu nor
event->destroy set on failure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250205102449.110145835@infradead.org
---
 kernel/events/core.c | 65 +++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b2334d2..f159dba 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12109,40 +12109,51 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	if (ctx)
 		perf_event_ctx_unlock(event->group_leader, ctx);
 
-	if (!ret) {
-		if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-		    has_extended_regs(event))
-			ret = -EOPNOTSUPP;
+	if (ret)
+		goto err_pmu;
 
-		if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE &&
-		    event_has_any_exclude_flag(event))
-			ret = -EINVAL;
+	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
+	    has_extended_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
 
-		if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) {
-			const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu);
-			struct cpumask *pmu_cpumask = perf_scope_cpumask(pmu->scope);
-			int cpu;
-
-			if (pmu_cpumask && cpumask) {
-				cpu = cpumask_any_and(pmu_cpumask, cpumask);
-				if (cpu >= nr_cpu_ids)
-					ret = -ENODEV;
-				else
-					event->event_caps |= PERF_EV_CAP_READ_SCOPE;
-			} else {
-				ret = -ENODEV;
-			}
-		}
+	if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE &&
+	    event_has_any_exclude_flag(event)) {
+		ret = -EINVAL;
+		goto err_destroy;
+	}
 
-		if (ret && event->destroy)
-			event->destroy(event);
+	if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) {
+		const struct cpumask *cpumask;
+		struct cpumask *pmu_cpumask;
+		int cpu;
+
+		cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu);
+		pmu_cpumask = perf_scope_cpumask(pmu->scope);
+
+		ret = -ENODEV;
+		if (!pmu_cpumask || !cpumask)
+			goto err_destroy;
+
+		cpu = cpumask_any_and(pmu_cpumask, cpumask);
+		if (cpu >= nr_cpu_ids)
+			goto err_destroy;
+
+		event->event_caps |= PERF_EV_CAP_READ_SCOPE;
 	}
 
-	if (ret) {
-		event->pmu = NULL;
-		module_put(pmu->module);
+	return 0;
+
+err_destroy:
+	if (event->destroy) {
+		event->destroy(event);
+		event->destroy = NULL;
 	}
 
+err_pmu:
+	event->pmu = NULL;
+	module_put(pmu->module);
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-03-05 11:29 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05 10:21 [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 01/24] lockdep: Fix might_fault() Peter Zijlstra
2025-02-06 18:19   ` David Hildenbrand
2025-02-05 10:21 ` [PATCH v2 02/24] perf: Ensure bpf_perf_link path is properly serialized Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 03/24] perf: Simplify child event tear-down Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 04/24] perf: Simplify perf_event_free_task() wait Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 05/24] perf: Simplify perf_event_release_kernel() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 06/24] perf: Fix pmus_lock vs pmus_srcu ordering Peter Zijlstra
2025-02-27 16:59   ` Lucas De Marchi
2025-02-05 10:21 ` [PATCH v2 07/24] perf: Fix perf_pmu_register() vs perf_init_event() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 08/24] perf: Cleanup perf_try_init_event() Peter Zijlstra
2025-03-05 11:29   ` [tip: perf/core] perf/core: Clean up perf_try_init_event() tip-bot2 for Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 09/24] perf: Simplify perf_event_alloc() error path Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 10/24] perf: Simplify perf_pmu_register() " Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 11/24] perf: Simplify perf_pmu_register() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 12/24] perf: Simplify perf_init_event() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 13/24] perf: Simplify perf_event_alloc() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 14/24] perf: Merge pmu_disable_count into cpu_pmu_context Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 15/24] perf: Add this_cpc() helper Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 16/24] perf: Detach perf_cpu_pmu_context and pmu lifetimes Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 17/24] perf: Introduce perf_free_addr_filters() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 18/24] perf: Robustify perf_event_free_bpf_prog() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 19/24] perf: Simplify perf_mmap() control flow Peter Zijlstra
2025-03-03  5:39   ` Ravi Bangoria
2025-03-03 11:19     ` Ingo Molnar
2025-03-03 13:36       ` Ravi Bangoria
2025-03-04  8:44         ` Ingo Molnar
2025-02-05 10:21 ` [PATCH v2 20/24] perf: Fix perf_mmap() failure path Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 21/24] perf: Further simplify perf_mmap() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 22/24] perf: Remove retry loop from perf_mmap() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 23/24] perf: Lift event->mmap_mutex in perf_mmap() Peter Zijlstra
2025-02-05 10:21 ` [PATCH v2 24/24] perf: Make perf_pmu_unregister() useable Peter Zijlstra
2025-02-10  6:39   ` Ravi Bangoria
2025-02-11 15:46     ` Peter Zijlstra
2025-02-10  6:42   ` Ravi Bangoria
2025-02-12 12:49     ` Peter Zijlstra
2025-02-13  7:52       ` Ravi Bangoria
2025-02-13 13:08         ` Peter Zijlstra
2025-02-14  3:57           ` Ravi Bangoria
2025-02-14 20:24         ` Peter Zijlstra
2025-02-17  8:24           ` Ravi Bangoria
2025-02-17 16:31             ` Ravi Bangoria
2025-02-19 13:23               ` Ravi Bangoria
2025-02-19 14:30                 ` Ravi Bangoria
2025-02-10  6:59   ` Ravi Bangoria
2025-02-13 13:07     ` Peter Zijlstra
2025-03-03  6:01 ` [PATCH v2 00/24] perf: Make perf_pmu_unregister() usable Ravi Bangoria

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox